Data Information
The data set we have chosen are related to red and white variants of the Portuguese “Vinho Verde” wine. There
are a total of 12 inputs variables obtained using physicochemical tests in the data set, which are: 1 - fixed
acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Research Question
Can we predict with good accuracy the quality of wine for both the red and the white varieties by considering
their chemical properties, and identify the most significant features that predict a good red wine or white
wine?
Research Approach
To answer the research question, we will undertake the following approach:
This is done by:
* Representing each variable as a boxplot
* Scatterplot matrix
good if it is above the median value of 6, and the wine as bad
if it is equal to or lower than 6. Since this is a binary response variable, there are three suitable
classifiers we have learnt which are:Thus, we will build each of the classifer, evaluate their performance by looking at their ROC plots, as well as their AUC values, and then decide on which classifier is best suited to predict wine quality.
data_redwine <- read.csv("winequality-red.csv")
data_whitewine <- read.csv("winequality-white.csv")
First, we need to see if the red and the white wine data are significantly differnt from each other. To this end, we utilize two methods: k-means clustering and Principal Means Analysis (PCA)
Before that, we temporarily label the data as red or wine by adding an additional column to the datasets.
data_redwine$label <- rep("red", nrow(data_redwine))
data_whitewine$label <- rep("white", nrow(data_whitewine))
total <- rbind(data_redwine, data_whitewine)
total <- na.omit(total)
Using k-means clustering to attempt to classify the total wine dataset
kmdata <- as.matrix(total[c("fixed.acidity", "volatile.acidity", "citric.acid",
"residual.sugar", "chlorides", "free.sulfur.dioxide", "total.sulfur.dioxide", "density",
"pH", "sulphates", "alcohol")])
km <- kmeans(kmdata, 2, nstart = 25)
Because the dataset has 11 predictors, there are \(^11C_2 = 55\) ways to plot pairs of predictors on a 2D-plane. We only select a few to plot below:
df = as.data.frame(kmdata)
df$cluster = factor(km$cluster)
centers = as.data.frame(km$centers)
g1 <- ggplot(data = df, aes(x = fixed.acidity, y = volatile.acidity, color = cluster)) +
geom_point() + theme(legend.position = "right") +
geom_point(data = centers, aes(x = fixed.acidity, y = volatile.acidity, color = as.factor(c(1,2))), size = 10, alpha = 0.3, show.legend = FALSE)
g2 <- ggplot(data = df, aes(x = citric.acid, y = residual.sugar, color = cluster)) +
geom_point() + theme(legend.position = "right") +
geom_point(data = centers, aes(x = citric.acid, y = residual.sugar, color = as.factor(c(1,2))), size = 10, alpha = 0.3, show.legend = FALSE)
g3 <- ggplot(data = df, aes(x = chlorides, y = free.sulfur.dioxide, color = cluster)) +
geom_point() + theme(legend.position = "right") +
geom_point(data = centers, aes(x = chlorides, y = free.sulfur.dioxide, color = as.factor(c(1,2))), size = 10, alpha = 0.3, show.legend = FALSE)
g4 <- ggplot(data = df, aes(x = total.sulfur.dioxide, y = density, color = cluster)) +
geom_point() + theme(legend.position = "right") +
geom_point(data = centers, aes(x = total.sulfur.dioxide, y = density, color = as.factor(c(1,2))), size = 10, alpha = 0.3, show.legend = FALSE)
g5 <- ggplot(data = df, aes(x = pH, y = sulphates, color = cluster)) +
geom_point() + theme(legend.position = "right") +
geom_point(data = centers, aes(x = pH, y = sulphates, color = as.factor(c(1,2))), size = 10, alpha = 0.3, show.legend = FALSE)
grid.arrange(g1, g2, g3, g4, g5, nrow = 5)
We see that several of the plots don’t do a good job of separating the red and the white wines. In fact, only one of the above plots – density vs. total sulfur dioxide – shows some distinction between the two kinds of wine. We need a different method.
Using Principal Components Analysis (PCA) for clustering the two wines
We run a PCA on the combined dataset as follows:
source('https://raw.githubusercontent.com/vqv/ggbiplot/master/R/ggbiplot.r')
source('https://raw.githubusercontent.com/vqv/ggbiplot/master/R/ggscreeplot.r')
pca_total <- prcomp(total[,c(1:11)], scale. = TRUE, center = TRUE)
ggbiplot(pca_total, obs.scale = 1, var.scale = 1, groups = total$label, ellipse = TRUE) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
PC1 vs. PC3:
ggbiplot(pca_total, obs.scale = 1, var.scale = 1, choices = c(1,3), groups = total$label, ellipse = TRUE) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')
PC1 vs. PC4:
ggbiplot(pca_total, obs.scale = 1, var.scale = 1, choices = c(1,4), groups = total$label, ellipse = TRUE) +
scale_color_discrete(name = '') +
theme(legend.direction = 'horizontal', legend.position = 'top')
As we can see, several of the Principal Components do a decent job of separating out the red wines and the white wines. This motivates us to analyze each type of wine separately for the rest of our project.
Exploring Correlation Between Variables
Correlated predictors are the bane of data analysis. Most of our classification methods require that the predictors in the model be as uncorrelated as possible. To that end, we use the strategy of removing one predictor variable from every pair of highly correlated variables in the dataset.
corr_red <- cor(data_redwine[,c(1:11)])
corr_white <- cor(data_whitewine[, c(1:11)])
Correlation graphs/Heat-maps of the predictors for the Red wine dataset:
library('gplots')
library('PerformanceAnalytics')
red_map <- heatmap.2(x = corr_red, col = rev(heat.colors(16)), symm = TRUE, dendrogram = 'none',
trace = 'none', main = "Correlation Matrix for Red Wines", key = TRUE,
lmat=rbind( c(0, 3), c(2,1), c(0,4) ), lhei=c(1.5,4,1.5), lwid = c(0.5,3), srtRow = 10, srtCol = 20, symkey = TRUE, key.title = NA,
tracecol = 'blue')
chart.Correlation(corr_red, histogram = TRUE, pch = 1)
And for White wines:
white_map <- heatmap.2(x = corr_white, col = rev(heat.colors(16)), symm = TRUE, dendrogram = 'none',
trace = 'none', main = "Correlation Matrix for White Wines", key = TRUE,
lmat=rbind( c(0, 3), c(2,1), c(0,4) ), lhei=c(1.5,4,1.5), lwid = c(0.5,3), srtRow = 10, srtCol = 20, symkey = TRUE, key.title = NA,
tracecol = 'blue')
chart.Correlation(corr_white, histogram = TRUE, pch = 1)
Some predictors are highly correlated in both datasets (>50%). Moreover, some of them are not the same pairs across both datasets. We use the strategy described above and remove the correlated variables that have a single red asterisk (*) or more in the correlation charts above.
This leads us to a filtered set of less correlated predictors to use for the rest of our analyses:
# Retained variables for red
# citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + sulphates + alcohol
vars_red <- c("citric.acid", "residual.sugar", "chlorides", "free.sulfur.dioxide", "sulphates", "alcohol", "quality")
data_redwine_reduced <- data_redwine[vars_red]
# Retained variables for white
# fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + sulphates
vars_white <- c("fixed.acidity", "volatile.acidity", "citric.acid", "residual.sugar", "chlorides", "free.sulfur.dioxide", "sulphates", "quality")
data_whitewine_reduced <- data_whitewine[vars_white]
The new correlation heatmaps
First let’s create a correlation matrix
corr_red_new <- cor(data_redwine_reduced)
corr_white_new <- cor(data_whitewine_reduced)
The new correlation heat map for red wines:
red_map_new <- heatmap.2(x = corr_red_new, col = rev(heat.colors(16)), symm = TRUE, dendrogram = 'none',
trace = 'none', main = "Correlation Matrix for Red Wines (New)", key = TRUE,
lmat=rbind( c(0, 3), c(2,1), c(0,4) ), lhei=c(1.5,4,1.5), lwid = c(0.5,3), srtRow = 10, srtCol = 20, symkey = TRUE, key.title = NA,
tracecol = 'blue')
Similarly for white wines:
white_map_new <- heatmap.2(x = corr_white_new, col = rev(heat.colors(16)), symm = TRUE, dendrogram = 'none',
trace = 'none', main = "Correlation Matrix for White Wines (New)", key = TRUE,
lmat=rbind( c(0, 3), c(2,1), c(0,4) ), lhei=c(1.5,4,1.5), lwid = c(0.5,3), srtRow = 10, srtCol = 20, symkey = TRUE, key.title = NA,
tracecol = 'blue')
For all classifiers, the aim is to predict whether the wine is above median quality or otherwise, hence the
response variable would be the binary variable whether the wine is “good” or “bad”. We can prepare our data for
all three classifiers by adding the labels good or bad to the dataset.
Furthermore, we also use the reduced dataset with correlated variables removed as per the EDA above, as all three classifers do not handle (or do not handle well) correlated variables. We have still provided the classifiers with the full dataset in the appendix below.
All classifers also share the same partitioning of 80/20 for the training data and test data split.
Assign Labels
# assign labels for red wine (full dataset)
data_red_wine_labelled <- mutate(data_redwine,
label = case_when(
quality > 6 ~ "good",
quality <= 6 ~ "bad"
))
# assign labels for white wine (full dataset)
data_white_wine_labelled <- mutate(data_whitewine,
label = case_when(
quality > 6 ~ "good",
quality <= 6 ~ "bad"
))
# assign labels for red wine (reduced dataset)
data_redwine_reduced_labelled <- mutate(data_redwine_reduced,
label = case_when(
quality > 6 ~ "good",
quality <= 6 ~ "bad"
))
# assign labels for white wine (reduced dataset)
data_whitewine_reduced_labelled <- mutate(data_whitewine_reduced,
label = case_when(
quality > 6 ~ "good",
quality <= 6 ~ "bad"
))
data_red_wine_labelled$label_binary <- as.numeric((data_red_wine_labelled$quality >6 ))
data_white_wine_labelled$label_binary <- as.numeric((data_white_wine_labelled$quality >6 ))
data_redwine_reduced_labelled$label_binary <- as.numeric((data_redwine_reduced_labelled$quality >6 ))
data_whitewine_reduced_labelled$label_binary <- as.numeric((data_whitewine_reduced_labelled$quality >6 ))
Data Partitioning To compare the classifiers fairly, we will use the same partitioned training (80%) and testing (20%) data for all classifiers.
# Partition red wine full dataset
set.seed(100000)
indxTrain_red_full <- createDataPartition(y = data_red_wine_labelled$label ,p = 0.8,list = FALSE)
traindata_red_full <- data_red_wine_labelled[indxTrain_red_full, !(names(data_red_wine_labelled) %in% c('quality'))]
testdata_red_full <- data_red_wine_labelled[-indxTrain_red_full, !(names(data_red_wine_labelled) %in% c('quality'))]
# Partition white wine full dataset
set.seed(100000)
indxTrain_white_full <- createDataPartition(y = data_white_wine_labelled$label ,p = 0.8,list = FALSE)
traindata_white_full <- data_white_wine_labelled[indxTrain_white_full, !(names(data_white_wine_labelled) %in% c('quality'))]
testdata_white_full <- data_white_wine_labelled[-indxTrain_white_full, !(names(data_white_wine_labelled) %in% c('quality'))]
# Partition red wine reduced dataset
set.seed(100000)
indxTrain_red_reduced <- createDataPartition(y = data_redwine_reduced_labelled$label ,p = 0.8,list = FALSE)
traindata_red_reduced <- data_redwine_reduced_labelled[indxTrain_red_reduced, !(names(data_redwine_reduced_labelled) %in% c('quality'))]
testdata_red_reduced <- data_redwine_reduced_labelled[-indxTrain_red_reduced, !(names(data_redwine_reduced_labelled) %in% c('quality'))]
# Partition white wine reduced dataset
set.seed(100000)
indxTrain_white_reduced <- createDataPartition(y = data_whitewine_reduced_labelled$label ,p = 0.8,list = FALSE) # uses random sampling
traindata_white_reduced <- data_whitewine_reduced_labelled[indxTrain_white_reduced, !(names(data_whitewine_reduced_labelled) %in% c('quality'))]
testdata_white_reduced <- data_whitewine_reduced_labelled[-indxTrain_white_reduced, !(names(data_whitewine_reduced_labelled) %in% c('quality'))]
As mentioned with in the general approach to the classifiers, the Naive Bayes classifier we train uses the reduced dataset with the correlated variables removed. This is because Naïve Bayes assumes the variables in the data are conditionally independent. Therefore, it is sensitive to correlated variables because the algorithm may double count the effects.
Red Wine (Reduced Dataset)
nb_model_red_reduced <- naiveBayes(as.factor(label) ~ . - label_binary, traindata_red_reduced, laplace=.01)
nb_prediction_red_reduced <- predict(nb_model_red_reduced,
# remove column "label"
testdata_red_reduced[,!(names(data_redwine_reduced_labelled) %in% c('label','label_binary'))],
type='raw')
nb_score_red_reduced <- nb_prediction_red_reduced[, c("good")]
actual_class_red_reduced <- testdata_red_reduced$label == 'good'
nb_pred_red_reduced <- prediction(nb_score_red_reduced, actual_class_red_reduced)
nb_perf_red_reduced <- performance(nb_pred_red_reduced, "tpr", "fpr")
plot(nb_perf_red_reduced, main="Red Wine Quality Label NB Model Prediction Perf", col="red", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
nb_auc_red_reduced <- performance(nb_pred_red_reduced, "auc")
nb_auc_red_reduced <- unlist(slot(nb_auc_red_reduced, "y.values"))
nb_auc_red_reduced
## [1] 0.8724301
White Wine (Reduced Dataset)
### White Wine NB Analysis
nb_model_white_reduced <- naiveBayes(as.factor(label) ~ . - label_binary , traindata_white_reduced, laplace=.01)
nb_prediction_white_reduced <- predict(nb_model_white_reduced,
# remove column "label" and "label_binary"
testdata_white_reduced[,!(names(data_whitewine_reduced_labelled) %in% c('label', 'label_binary'))],
type='raw')
nb_score_white_reduced <- nb_prediction_white_reduced[, c("good")]
actual_class_white_reduced <- testdata_white_reduced$label == 'good'
nb_pred_white_reduced <- prediction(nb_score_white_reduced, actual_class_white_reduced)
nb_perf_white_reduced <- performance(nb_pred_white_reduced, "tpr", "fpr")
plot(nb_perf_white_reduced, col="blue", lwd=2, main="White Wine Quality Label NB Model Prediction Perf", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
nb_auc_white_reduced <- performance(nb_pred_white_reduced, "auc")
nb_auc_white_reduced <- unlist(slot(nb_auc_white_reduced, "y.values"))
nb_auc_white_reduced
## [1] 0.7165445
To compare the performance between the red and white wines for the Naive Bayes Classifer, we can also plot them in a side-by-side plot, or plot both the AOC curves in one plot:
Side-by-side Plot
par(mfrow=c(1,2))
plot(nb_perf_red_reduced, col="red", lwd=2, main="Red Wine Quality Label NB Model Prediction Perf", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
plot(nb_perf_white_reduced, col="blue", lwd=2, main="White Wine Quality Label NB Model Prediction Perf", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
Multiple lines in single plot
par(mfrow=c(1,1))
plot(nb_perf_red_reduced, col="red", lwd=2, main="Wine Quality Label NB Model Prediction Perf (Reduced Dataset)", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
# par(new = TRUE) # We draw on top of the existing plot.
# plot(perf, col = "red", lwd=2, main="", xlab="", ylab="")
par(new = TRUE) # We draw on top of the existing plot.
plot(nb_perf_white_reduced, col = "blue", lwd=2, main="", xlab="", ylab="")
legend("topleft",
legend = c("Red Wine", "White Wine"),
col = c("red", "blue"),
pch = c(0, 0, 0))
Red Wine (Reduced Dataset)
red_lr_train_reduced <- glm(formula = label_binary ~
citric.acid
+ residual.sugar + chlorides + free.sulfur.dioxide
+ sulphates + alcohol, data = traindata_red_reduced,
family = binomial(link="logit"))
summary(red_lr_train_reduced)
##
## Call:
## glm(formula = label_binary ~ citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + sulphates + alcohol, family = binomial(link = "logit"),
## data = traindata_red_reduced)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6603 -0.4518 -0.2455 -0.1610 2.9311
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.784556 1.249219 -11.835 < 2e-16 ***
## citric.acid 2.893960 0.528193 5.479 4.28e-08 ***
## residual.sugar 0.029863 0.069530 0.429 0.66756
## chlorides -12.057117 3.775605 -3.193 0.00141 **
## free.sulfur.dioxide -0.017101 0.009844 -1.737 0.08237 .
## sulphates 3.473044 0.576068 6.029 1.65e-09 ***
## alcohol 0.993624 0.093853 10.587 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1017.7 on 1279 degrees of freedom
## Residual deviance: 732.2 on 1273 degrees of freedom
## AIC: 746.2
##
## Number of Fisher Scoring iterations: 6
pred = predict.glm(red_lr_train_reduced, testdata_red_reduced, type="response")
predObj <- prediction(pred, testdata_red_reduced$label_binary)
rocObj = performance(predObj, measure="tpr", x.measure="fpr")
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:",
round(aucObj@y.values[[1]] ,4)))
lr_perf_red_reduced <- rocObj
lr_auc_red_reduced <- round(aucObj@y.values[[1]] ,4)
# Select only numeric predictors
mydata <- data_redwine_reduced_labelled %>%
dplyr::select_if(is.numeric)
predictors <- colnames(mydata)
probabilities <- predict(red_lr_train_reduced, newdata = data_redwine_reduced_labelled ,type = "response")
# Bind the logit and tidying the data for plot
mydata <- mydata %>%
mutate(logit = log(probabilities/(1-probabilities))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
ggplot(mydata, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
## `geom_smooth()` using formula 'y ~ x'
Whitw Wine (Reduced Dataset)
white_lr_train_reduced <- glm(formula = label_binary ~
fixed.acidity + volatile.acidity + citric.acid
+ residual.sugar + chlorides + free.sulfur.dioxide
+ sulphates, data = traindata_white_reduced,
family = binomial(link="logit"))
summary(white_lr_train_reduced)
##
## Call:
## glm(formula = label_binary ~ fixed.acidity + volatile.acidity +
## citric.acid + residual.sugar + chlorides + free.sulfur.dioxide +
## sulphates, family = binomial(link = "logit"), data = traindata_white_reduced)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4520 -0.7402 -0.5784 -0.2340 3.5343
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.048571 0.426369 4.805 1.55e-06 ***
## fixed.acidity -0.129183 0.050037 -2.582 0.009830 **
## volatile.acidity -1.652724 0.446909 -3.698 0.000217 ***
## citric.acid -0.349132 0.383067 -0.911 0.362078
## residual.sugar -0.027042 0.009325 -2.900 0.003732 **
## chlorides -54.943570 4.393521 -12.506 < 2e-16 ***
## free.sulfur.dioxide 0.003434 0.002461 1.396 0.162836
## sulphates 0.909357 0.338748 2.684 0.007265 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4093.7 on 3918 degrees of freedom
## Residual deviance: 3789.0 on 3911 degrees of freedom
## AIC: 3805
##
## Number of Fisher Scoring iterations: 5
pred = predict.glm(white_lr_train_reduced, testdata_white_reduced, type="response")
predObj <- prediction(pred, testdata_white_reduced$label_binary)
rocObj = performance(predObj, measure="tpr", x.measure="fpr")
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:",
round(aucObj@y.values[[1]] ,4)))
lr_perf_white_reduced <- rocObj
lr_auc_white_reduced <- round(aucObj@y.values[[1]] ,4)
# Select only numeric predictors
mydata <- data_whitewine_reduced_labelled %>%
dplyr::select_if(is.numeric)
predictors <- colnames(mydata)
probabilities <- predict(white_lr_train_reduced, newdata = data_whitewine_reduced_labelled ,type = "response")
# Bind the logit and tidying the data for plot
mydata <- mydata %>%
mutate(logit = log(probabilities/(1-probabilities))) %>%
gather(key = "predictors", value = "predictor.value", -logit)
ggplot(mydata, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
geom_smooth(method = "loess") +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")
## `geom_smooth()` using formula 'y ~ x'
Red Wine (Reduced Dataset)
# Red wine (reduced dataset)
dt_fit_red_reduced <- rpart(label ~ citric.acid
+ residual.sugar + chlorides + free.sulfur.dioxide
+ sulphates + alcohol, data=traindata_red_reduced,control=rpart.control(minsplit=128), method="class",
parms=list(split='information'))
summary(dt_fit_red_reduced)
## Call:
## rpart(formula = label ~ citric.acid + residual.sugar + chlorides +
## free.sulfur.dioxide + sulphates + alcohol, data = traindata_red_reduced,
## method = "class", parms = list(split = "information"), control = rpart.control(minsplit = 128))
## n= 1280
##
## CP nsplit rel error xerror xstd
## 1 0.03639847 0 1.0000000 1.000000 0.07046897
## 2 0.01000000 3 0.8908046 1.063218 0.07230007
##
## Variable importance
## alcohol sulphates chlorides citric.acid
## 57 21 11 6
## residual.sugar free.sulfur.dioxide
## 2 2
##
## Node number 1: 1280 observations, complexity param=0.03639847
## predicted class=bad expected loss=0.1359375 P(node) =1
## class counts: 1106 174
## probabilities: 0.864 0.136
## left son=2 (739 obs) right son=3 (541 obs)
## Primary splits:
## alcohol < 10.45 to the left, improve=83.585570, (0 missing)
## sulphates < 0.685 to the left, improve=47.384780, (0 missing)
## citric.acid < 0.315 to the left, improve=46.525590, (0 missing)
## chlorides < 0.0675 to the right, improve=21.069190, (0 missing)
## free.sulfur.dioxide < 13.5 to the right, improve= 7.970615, (0 missing)
## Surrogate splits:
## chlorides < 0.0685 to the right, agree=0.662, adj=0.201, (0 split)
## sulphates < 0.675 to the left, agree=0.636, adj=0.139, (0 split)
## citric.acid < 0.315 to the left, agree=0.624, adj=0.111, (0 split)
## residual.sugar < 4.15 to the left, agree=0.592, adj=0.035, (0 split)
## free.sulfur.dioxide < 3.5 to the right, agree=0.586, adj=0.020, (0 split)
##
## Node number 2: 739 observations
## predicted class=bad expected loss=0.03247632 P(node) =0.5773438
## class counts: 715 24
## probabilities: 0.968 0.032
##
## Node number 3: 541 observations, complexity param=0.03639847
## predicted class=bad expected loss=0.2772643 P(node) =0.4226563
## class counts: 391 150
## probabilities: 0.723 0.277
## left son=6 (363 obs) right son=7 (178 obs)
## Primary splits:
## sulphates < 0.735 to the left, improve=22.797060, (0 missing)
## citric.acid < 0.275 to the left, improve=21.836480, (0 missing)
## alcohol < 11.55 to the left, improve=20.390010, (0 missing)
## chlorides < 0.0675 to the right, improve= 4.376036, (0 missing)
## free.sulfur.dioxide < 13.5 to the right, improve= 4.143684, (0 missing)
## Surrogate splits:
## alcohol < 13.53333 to the left, agree=0.688, adj=0.051, (0 split)
## free.sulfur.dioxide < 36.5 to the left, agree=0.684, adj=0.039, (0 split)
## chlorides < 0.191 to the left, agree=0.677, adj=0.017, (0 split)
##
## Node number 6: 363 observations
## predicted class=bad expected loss=0.184573 P(node) =0.2835937
## class counts: 296 67
## probabilities: 0.815 0.185
##
## Node number 7: 178 observations, complexity param=0.03639847
## predicted class=bad expected loss=0.4662921 P(node) =0.1390625
## class counts: 95 83
## probabilities: 0.534 0.466
## left son=14 (113 obs) right son=15 (65 obs)
## Primary splits:
## alcohol < 11.65 to the left, improve=6.7203070, (0 missing)
## chlorides < 0.0635 to the right, improve=3.3726720, (0 missing)
## citric.acid < 0.28 to the left, improve=2.8359240, (0 missing)
## free.sulfur.dioxide < 21.5 to the right, improve=2.3676450, (0 missing)
## sulphates < 0.795 to the left, improve=0.2092203, (0 missing)
## Surrogate splits:
## chlorides < 0.053 to the right, agree=0.685, adj=0.138, (0 split)
## residual.sugar < 6.15 to the left, agree=0.669, adj=0.092, (0 split)
## citric.acid < 0.015 to the right, agree=0.663, adj=0.077, (0 split)
##
## Node number 14: 113 observations
## predicted class=bad expected loss=0.3628319 P(node) =0.08828125
## class counts: 72 41
## probabilities: 0.637 0.363
##
## Node number 15: 65 observations
## predicted class=good expected loss=0.3538462 P(node) =0.05078125
## class counts: 23 42
## probabilities: 0.354 0.646
rpart.plot(dt_fit_red_reduced, type=4, extra=1)
dt_prediction_red_reduced <- predict(dt_fit_red_reduced,
testdata_red_reduced[, !(names(data_redwine_reduced_labelled) %in% c('label', 'label_binary'))],
type='prob')
dt_pred_red_reduced <- prediction(dt_prediction_red_reduced[, 2], testdata_red_reduced$label)
dt_perf_red_reduced <- performance(dt_pred_red_reduced, "tpr", "fpr")
plot(dt_perf_red_reduced, col="red", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
dt_auc_red_reduced <- performance(dt_pred_red_reduced, "auc")
dt_auc_red_reduced <- unlist(slot(dt_auc_red_reduced,"y.values"))
dt_auc_red_reduced
## [1] 0.8092349
White Wine (Reduced Dataset)
# White wine (reduced dataset)
dt_fit_white_reduced <- rpart(label ~ fixed.acidity + volatile.acidity + citric.acid
+ residual.sugar + chlorides + free.sulfur.dioxide
+ sulphates, data=traindata_white_reduced,control=rpart.control(minsplit=390, cp = 0.0085), method="class",
parms=list(split='information'))
summary(dt_fit_white_reduced)
## Call:
## rpart(formula = label ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + sulphates,
## data = traindata_white_reduced, method = "class", parms = list(split = "information"),
## control = rpart.control(minsplit = 390, cp = 0.0085))
## n= 3919
##
## CP nsplit rel error xerror xstd
## 1 0.008726415 0 1.0000000 1.0000000 0.03039865
## 2 0.008500000 6 0.9339623 0.9834906 0.03021531
##
## Variable importance
## chlorides residual.sugar free.sulfur.dioxide volatile.acidity
## 58 21 7 4
## fixed.acidity sulphates citric.acid
## 4 3 3
##
## Node number 1: 3919 observations, complexity param=0.008726415
## predicted class=bad expected loss=0.2163817 P(node) =1
## class counts: 3071 848
## probabilities: 0.784 0.216
## left son=2 (2412 obs) right son=3 (1507 obs)
## Primary splits:
## chlorides < 0.0395 to the right, improve=127.94500, (0 missing)
## residual.sugar < 6.35 to the right, improve= 36.16577, (0 missing)
## citric.acid < 0.235 to the left, improve= 35.78384, (0 missing)
## free.sulfur.dioxide < 11.75 to the left, improve= 22.52309, (0 missing)
## volatile.acidity < 0.195 to the right, improve= 22.29511, (0 missing)
## Surrogate splits:
## sulphates < 0.365 to the right, agree=0.635, adj=0.050, (0 split)
## fixed.acidity < 5.65 to the right, agree=0.634, adj=0.049, (0 split)
## residual.sugar < 0.975 to the right, agree=0.619, adj=0.008, (0 split)
## free.sulfur.dioxide < 5.5 to the right, agree=0.618, adj=0.007, (0 split)
## volatile.acidity < 0.095 to the right, agree=0.616, adj=0.001, (0 split)
##
## Node number 2: 2412 observations
## predicted class=bad expected loss=0.1322554 P(node) =0.6154631
## class counts: 2093 319
## probabilities: 0.868 0.132
##
## Node number 3: 1507 observations, complexity param=0.008726415
## predicted class=bad expected loss=0.3510285 P(node) =0.3845369
## class counts: 978 529
## probabilities: 0.649 0.351
## left son=6 (366 obs) right son=7 (1141 obs)
## Primary splits:
## residual.sugar < 7.775 to the right, improve=20.515640, (0 missing)
## free.sulfur.dioxide < 23.5 to the left, improve=15.660440, (0 missing)
## fixed.acidity < 7.15 to the right, improve=10.737170, (0 missing)
## sulphates < 0.565 to the left, improve= 9.819161, (0 missing)
## citric.acid < 0.395 to the right, improve= 8.357649, (0 missing)
## Surrogate splits:
## citric.acid < 0.535 to the right, agree=0.763, adj=0.025, (0 split)
## free.sulfur.dioxide < 52.5 to the right, agree=0.759, adj=0.008, (0 split)
## fixed.acidity < 9.75 to the right, agree=0.758, adj=0.005, (0 split)
##
## Node number 6: 366 observations
## predicted class=bad expected loss=0.215847 P(node) =0.09339117
## class counts: 287 79
## probabilities: 0.784 0.216
##
## Node number 7: 1141 observations, complexity param=0.008726415
## predicted class=bad expected loss=0.3943909 P(node) =0.2911457
## class counts: 691 450
## probabilities: 0.606 0.394
## left son=14 (204 obs) right son=15 (937 obs)
## Primary splits:
## residual.sugar < 1.225 to the left, improve=24.509670, (0 missing)
## free.sulfur.dioxide < 23.5 to the left, improve=19.304660, (0 missing)
## sulphates < 0.565 to the left, improve= 7.436217, (0 missing)
## volatile.acidity < 0.3175 to the left, improve= 7.004484, (0 missing)
## fixed.acidity < 7.35 to the right, improve= 6.172906, (0 missing)
## Surrogate splits:
## fixed.acidity < 9.05 to the right, agree=0.824, adj=0.015, (0 split)
##
## Node number 14: 204 observations
## predicted class=bad expected loss=0.1862745 P(node) =0.0520541
## class counts: 166 38
## probabilities: 0.814 0.186
##
## Node number 15: 937 observations, complexity param=0.008726415
## predicted class=bad expected loss=0.4397012 P(node) =0.2390916
## class counts: 525 412
## probabilities: 0.560 0.440
## left son=30 (329 obs) right son=31 (608 obs)
## Primary splits:
## free.sulfur.dioxide < 24.5 to the left, improve=13.702400, (0 missing)
## volatile.acidity < 0.3175 to the left, improve= 6.130320, (0 missing)
## fixed.acidity < 7.35 to the right, improve= 5.952005, (0 missing)
## residual.sugar < 1.825 to the left, improve= 5.748669, (0 missing)
## citric.acid < 0.395 to the right, improve= 5.701163, (0 missing)
## Surrogate splits:
## volatile.acidity < 0.1575 to the left, agree=0.667, adj=0.052, (0 split)
## sulphates < 0.775 to the right, agree=0.657, adj=0.024, (0 split)
## fixed.acidity < 8.45 to the right, agree=0.653, adj=0.012, (0 split)
## chlorides < 0.0235 to the left, agree=0.653, adj=0.012, (0 split)
##
## Node number 30: 329 observations
## predicted class=bad expected loss=0.325228 P(node) =0.08394999
## class counts: 222 107
## probabilities: 0.675 0.325
##
## Node number 31: 608 observations, complexity param=0.008726415
## predicted class=good expected loss=0.4983553 P(node) =0.1551416
## class counts: 303 305
## probabilities: 0.498 0.502
## left son=62 (139 obs) right son=63 (469 obs)
## Primary splits:
## citric.acid < 0.265 to the left, improve=5.911978, (0 missing)
## residual.sugar < 1.825 to the left, improve=5.400630, (0 missing)
## sulphates < 0.565 to the left, improve=5.050834, (0 missing)
## volatile.acidity < 0.315 to the left, improve=4.255048, (0 missing)
## chlorides < 0.0305 to the right, improve=1.188817, (0 missing)
## Surrogate splits:
## fixed.acidity < 5.55 to the left, agree=0.798, adj=0.115, (0 split)
## volatile.acidity < 0.495 to the right, agree=0.781, adj=0.043, (0 split)
## free.sulfur.dioxide < 64.5 to the right, agree=0.775, adj=0.014, (0 split)
##
## Node number 62: 139 observations
## predicted class=bad expected loss=0.3741007 P(node) =0.03546823
## class counts: 87 52
## probabilities: 0.626 0.374
##
## Node number 63: 469 observations, complexity param=0.008726415
## predicted class=good expected loss=0.4605544 P(node) =0.1196734
## class counts: 216 253
## probabilities: 0.461 0.539
## left son=126 (279 obs) right son=127 (190 obs)
## Primary splits:
## volatile.acidity < 0.275 to the left, improve=7.5629200, (0 missing)
## sulphates < 0.525 to the left, improve=5.4085590, (0 missing)
## citric.acid < 0.335 to the right, improve=4.8798660, (0 missing)
## residual.sugar < 1.825 to the left, improve=3.7564320, (0 missing)
## chlorides < 0.0335 to the right, improve=0.9583567, (0 missing)
## Surrogate splits:
## residual.sugar < 3.15 to the left, agree=0.627, adj=0.079, (0 split)
## fixed.acidity < 7.55 to the left, agree=0.616, adj=0.053, (0 split)
## chlorides < 0.0205 to the right, agree=0.606, adj=0.026, (0 split)
## free.sulfur.dioxide < 28.5 to the right, agree=0.606, adj=0.026, (0 split)
## citric.acid < 0.495 to the left, agree=0.603, adj=0.021, (0 split)
##
## Node number 126: 279 observations
## predicted class=bad expected loss=0.4659498 P(node) =0.07119163
## class counts: 149 130
## probabilities: 0.534 0.466
##
## Node number 127: 190 observations
## predicted class=good expected loss=0.3526316 P(node) =0.04848176
## class counts: 67 123
## probabilities: 0.353 0.647
rpart.plot(dt_fit_white_reduced, type=4, extra=1)
dt_prediction_white_reduced <- predict(dt_fit_white_reduced,
testdata_white_reduced[, !(names(data_whitewine_reduced_labelled) %in% c('label', 'label_binary'))],
type='prob')
dt_pred_white_reduced <- prediction(dt_prediction_white_reduced[, 2], testdata_white_reduced$label)
dt_perf_white_reduced <- performance(dt_pred_white_reduced, "tpr", "fpr")
plot(dt_perf_white_reduced, col="blue", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
dt_auc_white_reduced <- performance(dt_pred_white_reduced, "auc")
dt_auc_white_reduced <- unlist(slot(dt_auc_white_reduced,"y.values"))
dt_auc_white_reduced
## [1] 0.7139308
Evaluating Classifiers
Comparing AUC values:
print(paste0("AUC of Naive Bayes Model (Red Wine Reduced Dataset):", nb_auc_red_reduced))
## [1] "AUC of Naive Bayes Model (Red Wine Reduced Dataset):0.872430064037746"
print(paste0("AUC of Decision Tree Model (Red Wine Reduced Dataset):", dt_auc_red_reduced))
## [1] "AUC of Decision Tree Model (Red Wine Reduced Dataset):0.809234917425008"
print(paste0("AUC of Logistic Regression Model (Red Wine Reduced Dataset):", lr_auc_red_reduced))
## [1] "AUC of Logistic Regression Model (Red Wine Reduced Dataset):0.8617"
print(paste0("AUC of Naive Bayes Model (White Wine Reduced Dataset):", nb_auc_white_reduced))
## [1] "AUC of Naive Bayes Model (White Wine Reduced Dataset):0.716544488450471"
# print(paste0("AUC of Decision Tree Model (White Wine Reduced Dataset):", dt_auc_white_reduced))
print(paste0("AUC of Logistic Regression Model (White Wine Reduced Dataset):", lr_auc_white_reduced))
## [1] "AUC of Logistic Regression Model (White Wine Reduced Dataset):0.751"
Combining AOCs in single plot
par(mfrow=c(1,1))
plot(nb_perf_red_reduced, col="darkred", lwd=2, main="Wine Quality Label NB Model Prediction Perf (Full Dataset)", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
# par(new = TRUE) # We draw on top of the existing plot.
# plot(perf, col = "red", lwd=2, main="", xlab="", ylab="")
par(new = TRUE) # We draw on top of the existing plot.
plot(nb_perf_white_reduced, col = "blue", lwd=2, main="", xlab="", ylab="")
par(new = TRUE) # We draw on top of the existing plot.
plot(dt_perf_red_reduced, col = "red", lwd=2, lty=2, main="", xlab="", ylab="")
# par(new = TRUE) # We draw on top of the existing plot.
# plot(dt_perf_white_reduced, col = "blue", lwd=2, lty=2, main="", xlab="", ylab="")
par(new = TRUE) # We draw on top of the existing plot.
plot(lr_perf_red_reduced, col = "red", lwd=2, lty=3, main="", xlab="", ylab="")
par(new = TRUE) # We draw on top of the existing plot.
plot(lr_perf_white_reduced, col = "darkblue", lwd=2, lty=3, main="", xlab="", ylab="")
legend("topright",
legend = c("Red Wine (NB)", "White Wine(NB)", "Red Wine (DT)", "White Wine(DT)", "Red Wine (LR)", "White Wine(LR)"),
col = c("darkred", "darkblue", "red", "blue", "red", "blue"),
lty = c(1, 1, 2, 2, 3, 3))
Stepwise Regression
We use stepwise regression using logistic regression focusing on p-values in order to determine the significant variables that affect the quality of wine (using quality above median as the binary response variable)
Since we use logistic regression using stepwise regression, we must make sure that variables are independent from one another, i.e none of the variables are heavily correlated with each other, so we use the reduced predictor set as explained in our EDA.
We want the most significant variables, so the p-value threshold is set at 0.001. Both backward elimination and forward selection are considered.
Red Wine
data_redwine_reduced_labelled
Back Elimination
red_new_lr <- glm(formula = data_redwine_reduced_labelled$label_binary ~
data_redwine_reduced_labelled$citric.acid
+ data_redwine_reduced_labelled$residual.sugar + data_redwine_reduced_labelled$chlorides + data_redwine_reduced_labelled$free.sulfur.dioxide
+ data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$alcohol,
family = binomial(link="logit"))
summary(red_new_lr)
##
## Call:
## glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$citric.acid +
## data_redwine_reduced_labelled$residual.sugar + data_redwine_reduced_labelled$chlorides +
## data_redwine_reduced_labelled$free.sulfur.dioxide + data_redwine_reduced_labelled$sulphates +
## data_redwine_reduced_labelled$alcohol, family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5806 -0.4543 -0.2477 -0.1633 2.8974
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -14.403289 1.108754 -12.991
## data_redwine_reduced_labelled$citric.acid 2.650669 0.470346 5.636
## data_redwine_reduced_labelled$residual.sugar 0.031127 0.064625 0.482
## data_redwine_reduced_labelled$chlorides -12.443310 3.576709 -3.479
## data_redwine_reduced_labelled$free.sulfur.dioxide -0.020003 0.008916 -2.243
## data_redwine_reduced_labelled$sulphates 3.376231 0.490867 6.878
## data_redwine_reduced_labelled$alcohol 0.975752 0.083053 11.749
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## data_redwine_reduced_labelled$citric.acid 1.74e-08 ***
## data_redwine_reduced_labelled$residual.sugar 0.630051
## data_redwine_reduced_labelled$chlorides 0.000503 ***
## data_redwine_reduced_labelled$free.sulfur.dioxide 0.024871 *
## data_redwine_reduced_labelled$sulphates 6.07e-12 ***
## data_redwine_reduced_labelled$alcohol < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1269.92 on 1598 degrees of freedom
## Residual deviance: 918.25 on 1592 degrees of freedom
## AIC: 932.25
##
## Number of Fisher Scoring iterations: 6
As we can see residual.sugar has the highest p-val of 0.630051, so we remove it.
red_new_lr_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~
data_redwine_reduced_labelled$citric.acid + data_redwine_reduced_labelled$chlorides + data_redwine_reduced_labelled$free.sulfur.dioxide
+ data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$alcohol,
family = binomial(link="logit"))
summary(red_new_lr_2)
##
## Call:
## glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$citric.acid +
## data_redwine_reduced_labelled$chlorides + data_redwine_reduced_labelled$free.sulfur.dioxide +
## data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$alcohol,
## family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5327 -0.4561 -0.2475 -0.1635 2.8974
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -14.365245 1.102957 -13.024
## data_redwine_reduced_labelled$citric.acid 2.698825 0.459721 5.871
## data_redwine_reduced_labelled$chlorides -12.325562 3.545171 -3.477
## data_redwine_reduced_labelled$free.sulfur.dioxide -0.019790 0.008903 -2.223
## data_redwine_reduced_labelled$sulphates 3.355766 0.488806 6.865
## data_redwine_reduced_labelled$alcohol 0.978315 0.082841 11.810
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## data_redwine_reduced_labelled$citric.acid 4.34e-09 ***
## data_redwine_reduced_labelled$chlorides 0.000508 ***
## data_redwine_reduced_labelled$free.sulfur.dioxide 0.026230 *
## data_redwine_reduced_labelled$sulphates 6.64e-12 ***
## data_redwine_reduced_labelled$alcohol < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1269.92 on 1598 degrees of freedom
## Residual deviance: 918.48 on 1593 degrees of freedom
## AIC: 930.48
##
## Number of Fisher Scoring iterations: 6
As we can see free.sulfur.dioxide has the highest p-val of 0.026230, so we remove it.
red_new_lr_3 <- glm(formula = data_redwine_reduced_labelled$label_binary ~
data_redwine_reduced_labelled$citric.acid + data_redwine_reduced_labelled$chlorides
+ data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$alcohol,
family = binomial(link="logit"))
summary(red_new_lr_3)
##
## Call:
## glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$citric.acid +
## data_redwine_reduced_labelled$chlorides + data_redwine_reduced_labelled$sulphates +
## data_redwine_reduced_labelled$alcohol, family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6098 -0.4688 -0.2541 -0.1633 2.9498
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -14.72657 1.09046 -13.505 < 2e-16
## data_redwine_reduced_labelled$citric.acid 2.84374 0.45755 6.215 5.13e-10
## data_redwine_reduced_labelled$chlorides -11.75920 3.43012 -3.428 0.000608
## data_redwine_reduced_labelled$sulphates 3.19535 0.48095 6.644 3.06e-11
## data_redwine_reduced_labelled$alcohol 0.98623 0.08284 11.905 < 2e-16
##
## (Intercept) ***
## data_redwine_reduced_labelled$citric.acid ***
## data_redwine_reduced_labelled$chlorides ***
## data_redwine_reduced_labelled$sulphates ***
## data_redwine_reduced_labelled$alcohol ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1269.92 on 1598 degrees of freedom
## Residual deviance: 923.63 on 1594 degrees of freedom
## AIC: 933.63
##
## Number of Fisher Scoring iterations: 6
None of the variables have a p-value lower than 0.001, so we stop here.
Significant variables: - citric.acid - chlorides - sulphates - alcohol
Forward Selection
red_lr_new_fs1_1 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$citric.acid,
family = binomial(link="logit"))
red_lr_new_fs1_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$residual.sugar,
family = binomial(link="logit"))
red_lr_new_fs1_3 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$chlorides,
family = binomial(link="logit"))
red_lr_new_fs1_4 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$free.sulfur.dioxide,
family = binomial(link="logit"))
red_lr_new_fs1_5 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$sulphates,
family = binomial(link="logit"))
red_lr_new_fs1_6 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol,
family = binomial(link="logit"))
summary(red_lr_new_fs1_1)$coef
## Estimate Std. Error z value
## (Intercept) -2.861731 0.1571348 -18.211955
## data_redwine_reduced_labelled$citric.acid 3.209381 0.3875700 8.280776
## Pr(>|z|)
## (Intercept) 4.148644e-74
## data_redwine_reduced_labelled$citric.acid 1.223749e-16
summary(red_lr_new_fs1_2)$coef
## Estimate Std. Error z value
## (Intercept) -2.06570423 0.13762945 -15.009173
## data_redwine_reduced_labelled$residual.sugar 0.08238545 0.04375419 1.882916
## Pr(>|z|)
## (Intercept) 6.393974e-51
## data_redwine_reduced_labelled$residual.sugar 5.971181e-02
summary(red_lr_new_fs1_3)$coef
## Estimate Std. Error z value
## (Intercept) -0.390436 0.3239833 -1.205111
## data_redwine_reduced_labelled$chlorides -18.202353 4.1151963 -4.423204
## Pr(>|z|)
## (Intercept) 2.281603e-01
## data_redwine_reduced_labelled$chlorides 9.724784e-06
summary(red_lr_new_fs1_4)$coef
## Estimate Std. Error
## (Intercept) -1.51529114 0.133252373
## data_redwine_reduced_labelled$free.sulfur.dioxide -0.02236819 0.007827782
## z value Pr(>|z|)
## (Intercept) -11.371588 5.792376e-30
## data_redwine_reduced_labelled$free.sulfur.dioxide -2.857539 4.269397e-03
summary(red_lr_new_fs1_5)$coef
## Estimate Std. Error z value
## (Intercept) -3.696673 0.2790006 -13.249698
## data_redwine_reduced_labelled$sulphates 2.694416 0.3810194 7.071597
## Pr(>|z|)
## (Intercept) 4.530196e-40
## data_redwine_reduced_labelled$sulphates 1.531605e-12
summary(red_lr_new_fs1_6)$coef
## Estimate Std. Error z value
## (Intercept) -13.23466 0.84022125 -15.75140
## data_redwine_reduced_labelled$alcohol 1.05057 0.07486283 14.03326
## Pr(>|z|)
## (Intercept) 6.717456e-56
## data_redwine_reduced_labelled$alcohol 9.756085e-45
As we can see alcohol has the lowest p-val of 9.756085e-45, so we include it.
red_lr_new_fs2_1 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$citric.acid,
family = binomial(link="logit"))
red_lr_new_fs2_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$residual.sugar,
family = binomial(link="logit"))
red_lr_new_fs2_3 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$chlorides,
family = binomial(link="logit"))
red_lr_new_fs2_4 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$free.sulfur.dioxide,
family = binomial(link="logit"))
red_lr_new_fs2_5 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates,
family = binomial(link="logit"))
summary(red_lr_new_fs2_1)$coef
## Estimate Std. Error z value
## (Intercept) -14.052684 0.88436667 -15.890110
## data_redwine_reduced_labelled$alcohol 1.038926 0.07667443 13.549835
## data_redwine_reduced_labelled$citric.acid 2.960759 0.41927114 7.061682
## Pr(>|z|)
## (Intercept) 7.419755e-57
## data_redwine_reduced_labelled$alcohol 7.940879e-42
## data_redwine_reduced_labelled$citric.acid 1.644996e-12
summary(red_lr_new_fs2_2)$coef
## Estimate Std. Error z value
## (Intercept) -13.37146773 0.85032806 -15.725069
## data_redwine_reduced_labelled$alcohol 1.04693899 0.07485961 13.985366
## data_redwine_reduced_labelled$residual.sugar 0.06766884 0.05539129 1.221651
## Pr(>|z|)
## (Intercept) 1.018392e-55
## data_redwine_reduced_labelled$alcohol 1.914888e-44
## data_redwine_reduced_labelled$residual.sugar 2.218395e-01
summary(red_lr_new_fs2_3)$coef
## Estimate Std. Error z value
## (Intercept) -12.688354 0.94848325 -13.377521
## data_redwine_reduced_labelled$alcohol 1.027297 0.07700302 13.340997
## data_redwine_reduced_labelled$chlorides -3.607053 3.10247479 -1.162637
## Pr(>|z|)
## (Intercept) 8.183171e-41
## data_redwine_reduced_labelled$alcohol 1.336592e-40
## data_redwine_reduced_labelled$chlorides 2.449767e-01
summary(red_lr_new_fs2_4)$coef
## Estimate Std. Error
## (Intercept) -12.89767973 0.851508522
## data_redwine_reduced_labelled$alcohol 1.04337014 0.074887913
## data_redwine_reduced_labelled$free.sulfur.dioxide -0.01735816 0.008431011
## z value Pr(>|z|)
## (Intercept) -15.146859 7.947485e-52
## data_redwine_reduced_labelled$alcohol 13.932424 4.024622e-44
## data_redwine_reduced_labelled$free.sulfur.dioxide -2.058847 3.950893e-02
summary(red_lr_new_fs2_5)$coef
## Estimate Std. Error z value
## (Intercept) -15.860067 0.99419063 -15.952742
## data_redwine_reduced_labelled$alcohol 1.090337 0.07916258 13.773393
## data_redwine_reduced_labelled$sulphates 3.144493 0.42615799 7.378703
## Pr(>|z|)
## (Intercept) 2.726560e-57
## data_redwine_reduced_labelled$alcohol 3.684896e-43
## data_redwine_reduced_labelled$sulphates 1.598385e-13
As we can see sulphates has the lowest p-val of 1.598385e-13, so we include it.
red_lr_new_fs3_1 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$citric.acid,
family = binomial(link="logit"))
red_lr_new_fs3_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$residual.sugar,
family = binomial(link="logit"))
red_lr_new_fs3_3 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$chlorides,
family = binomial(link="logit"))
red_lr_new_fs3_4 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$free.sulfur.dioxide,
family = binomial(link="logit"))
summary(red_lr_new_fs3_1)$coef
## Estimate Std. Error z value
## (Intercept) -16.068750 1.01590661 -15.817153
## data_redwine_reduced_labelled$alcohol 1.077530 0.07994843 13.477808
## data_redwine_reduced_labelled$sulphates 2.529062 0.44061020 5.739908
## data_redwine_reduced_labelled$citric.acid 2.399661 0.43650497 5.497443
## Pr(>|z|)
## (Intercept) 2.369689e-56
## data_redwine_reduced_labelled$alcohol 2.112917e-41
## data_redwine_reduced_labelled$sulphates 9.472809e-09
## data_redwine_reduced_labelled$citric.acid 3.853376e-08
summary(red_lr_new_fs3_2)$coef
## Estimate Std. Error z value
## (Intercept) -16.0524804 1.00843682 -15.918182
## data_redwine_reduced_labelled$alcohol 1.0869830 0.07915845 13.731736
## data_redwine_reduced_labelled$sulphates 3.1678986 0.42774908 7.405974
## data_redwine_reduced_labelled$residual.sugar 0.0813842 0.05693833 1.429340
## Pr(>|z|)
## (Intercept) 4.739523e-57
## data_redwine_reduced_labelled$alcohol 6.554383e-43
## data_redwine_reduced_labelled$sulphates 1.301911e-13
## data_redwine_reduced_labelled$residual.sugar 1.529067e-01
summary(red_lr_new_fs3_3)$coef
## Estimate Std. Error z value
## (Intercept) -14.898718 1.05392887 -14.136360
## data_redwine_reduced_labelled$alcohol 1.025799 0.08165339 12.562841
## data_redwine_reduced_labelled$sulphates 3.720786 0.48047416 7.743988
## data_redwine_reduced_labelled$chlorides -7.980316 2.87455650 -2.776190
## Pr(>|z|)
## (Intercept) 2.267107e-45
## data_redwine_reduced_labelled$alcohol 3.379746e-36
## data_redwine_reduced_labelled$sulphates 9.634592e-15
## data_redwine_reduced_labelled$chlorides 5.500001e-03
summary(red_lr_new_fs3_4)$coef
## Estimate Std. Error
## (Intercept) -15.5279190 1.000274930
## data_redwine_reduced_labelled$alcohol 1.0836960 0.079280347
## data_redwine_reduced_labelled$sulphates 3.2831240 0.430336294
## data_redwine_reduced_labelled$free.sulfur.dioxide -0.0240922 0.008787798
## z value Pr(>|z|)
## (Intercept) -15.523651 2.399979e-54
## data_redwine_reduced_labelled$alcohol 13.669163 1.551701e-42
## data_redwine_reduced_labelled$sulphates 7.629205 2.362049e-14
## data_redwine_reduced_labelled$free.sulfur.dioxide -2.741552 6.114973e-03
As we can see citric.acid has the lowest p-val of 3.853376e-08, so we include it.
red_lr_new_fs4_1 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$citric.acid + data_redwine_reduced_labelled$residual.sugar,
family = binomial(link="logit"))
red_lr_new_fs4_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$citric.acid + data_redwine_reduced_labelled$chlorides,
family = binomial(link="logit"))
red_lr_new_fs4_3 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$citric.acid + data_redwine_reduced_labelled$free.sulfur.dioxide,
family = binomial(link="logit"))
summary(red_lr_new_fs4_1)$coef
## Estimate Std. Error
## (Intercept) -16.09245495 1.02538769
## data_redwine_reduced_labelled$alcohol 1.07703941 0.07999495
## data_redwine_reduced_labelled$sulphates 2.53693051 0.44303437
## data_redwine_reduced_labelled$citric.acid 2.38288714 0.44670246
## data_redwine_reduced_labelled$residual.sugar 0.01127305 0.06381395
## z value Pr(>|z|)
## (Intercept) -15.6940200 1.661893e-55
## data_redwine_reduced_labelled$alcohol 13.4638421 2.552887e-41
## data_redwine_reduced_labelled$sulphates 5.7262611 1.026681e-08
## data_redwine_reduced_labelled$citric.acid 5.3343944 9.586394e-08
## data_redwine_reduced_labelled$residual.sugar 0.1766549 8.597794e-01
summary(red_lr_new_fs4_2)$coef
## Estimate Std. Error z value
## (Intercept) -14.7265723 1.0904622 -13.504890
## data_redwine_reduced_labelled$alcohol 0.9862251 0.0828439 11.904619
## data_redwine_reduced_labelled$sulphates 3.1953460 0.4809525 6.643788
## data_redwine_reduced_labelled$citric.acid 2.8437433 0.4575547 6.215089
## data_redwine_reduced_labelled$chlorides -11.7592032 3.4301190 -3.428220
## Pr(>|z|)
## (Intercept) 1.463328e-41
## data_redwine_reduced_labelled$alcohol 1.119740e-32
## data_redwine_reduced_labelled$sulphates 3.057230e-11
## data_redwine_reduced_labelled$citric.acid 5.129550e-10
## data_redwine_reduced_labelled$chlorides 6.075525e-04
summary(red_lr_new_fs4_3)$coef
## Estimate Std. Error
## (Intercept) -15.80134270 1.022737890
## data_redwine_reduced_labelled$alcohol 1.07247216 0.080038765
## data_redwine_reduced_labelled$sulphates 2.65738674 0.445126973
## data_redwine_reduced_labelled$citric.acid 2.26280739 0.439923778
## data_redwine_reduced_labelled$free.sulfur.dioxide -0.01728409 0.008805842
## z value Pr(>|z|)
## (Intercept) -15.450041 7.539506e-54
## data_redwine_reduced_labelled$alcohol 13.399409 6.094635e-41
## data_redwine_reduced_labelled$sulphates 5.969952 2.373232e-09
## data_redwine_reduced_labelled$citric.acid 5.143635 2.694727e-07
## data_redwine_reduced_labelled$free.sulfur.dioxide -1.962798 4.966965e-02
As we can see chlorides has the lowest p-val of 6.075525e-04, so we include it.
red_lr_new_fs5_1 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$citric.acid + data_redwine_reduced_labelled$chlorides + data_redwine_reduced_labelled$residual.sugar,
family = binomial(link="logit"))
red_lr_new_fs5_2 <- glm(formula = data_redwine_reduced_labelled$label_binary ~ data_redwine_reduced_labelled$alcohol + data_redwine_reduced_labelled$sulphates + data_redwine_reduced_labelled$citric.acid + data_redwine_reduced_labelled$chlorides + data_redwine_reduced_labelled$free.sulfur.dioxide,
family = binomial(link="logit"))
summary(red_lr_new_fs5_1)$coef
## Estimate Std. Error
## (Intercept) -14.76166802 1.09708151
## data_redwine_reduced_labelled$alcohol 0.98473937 0.08298161
## data_redwine_reduced_labelled$sulphates 3.20985154 0.48285885
## data_redwine_reduced_labelled$citric.acid 2.81023432 0.46696219
## data_redwine_reduced_labelled$chlorides -11.84557305 3.45540631
## data_redwine_reduced_labelled$residual.sugar 0.02293995 0.06405785
## z value Pr(>|z|)
## (Intercept) -13.4553977 2.861951e-41
## data_redwine_reduced_labelled$alcohol 11.8669585 1.757412e-32
## data_redwine_reduced_labelled$sulphates 6.6475981 2.979145e-11
## data_redwine_reduced_labelled$citric.acid 6.0181196 1.764548e-09
## data_redwine_reduced_labelled$chlorides -3.4281274 6.077602e-04
## data_redwine_reduced_labelled$residual.sugar 0.3581131 7.202587e-01
summary(red_lr_new_fs5_2)$coef
## Estimate Std. Error
## (Intercept) -14.36524548 1.102957148
## data_redwine_reduced_labelled$alcohol 0.97831473 0.082841248
## data_redwine_reduced_labelled$sulphates 3.35576569 0.488806341
## data_redwine_reduced_labelled$citric.acid 2.69882466 0.459720863
## data_redwine_reduced_labelled$chlorides -12.32556183 3.545171137
## data_redwine_reduced_labelled$free.sulfur.dioxide -0.01979038 0.008903398
## z value Pr(>|z|)
## (Intercept) -13.024301 8.901291e-39
## data_redwine_reduced_labelled$alcohol 11.809512 3.485780e-32
## data_redwine_reduced_labelled$sulphates 6.865225 6.638636e-12
## data_redwine_reduced_labelled$citric.acid 5.870573 4.342928e-09
## data_redwine_reduced_labelled$chlorides -3.476718 5.075906e-04
## data_redwine_reduced_labelled$free.sulfur.dioxide -2.222790 2.622996e-02
As we can see, none of the variables have a p-value lower than 0.001, so we stop here.
Significant variables: - citric.acid - chlorides - sulphates - alcohol
White Wine
white_lr_new_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~
data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity +
data_whitewine_reduced_labelled$citric.acid + data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$chlorides +
data_whitewine_reduced_labelled$free.sulfur.dioxide + data_whitewine_reduced_labelled$sulphates,
family = binomial(link="logit"))
summary(white_lr_new_1)
##
## Call:
## glm(formula = data_whitewine_reduced_labelled$label_binary ~
## data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity +
## data_whitewine_reduced_labelled$citric.acid + data_whitewine_reduced_labelled$residual.sugar +
## data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$free.sulfur.dioxide +
## data_whitewine_reduced_labelled$sulphates, family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5190 -0.7364 -0.5650 -0.2283 3.5933
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 2.252605 0.385275
## data_whitewine_reduced_labelled$fixed.acidity -0.157333 0.045381
## data_whitewine_reduced_labelled$volatile.acidity -1.738417 0.402349
## data_whitewine_reduced_labelled$citric.acid -0.291321 0.346914
## data_whitewine_reduced_labelled$residual.sugar -0.032133 0.008523
## data_whitewine_reduced_labelled$chlorides -57.497984 3.972277
## data_whitewine_reduced_labelled$free.sulfur.dioxide 0.003737 0.002249
## data_whitewine_reduced_labelled$sulphates 1.130551 0.303713
## z value Pr(>|z|)
## (Intercept) 5.847 5.01e-09 ***
## data_whitewine_reduced_labelled$fixed.acidity -3.467 0.000526 ***
## data_whitewine_reduced_labelled$volatile.acidity -4.321 1.56e-05 ***
## data_whitewine_reduced_labelled$citric.acid -0.840 0.401049
## data_whitewine_reduced_labelled$residual.sugar -3.770 0.000163 ***
## data_whitewine_reduced_labelled$chlorides -14.475 < 2e-16 ***
## data_whitewine_reduced_labelled$free.sulfur.dioxide 1.661 0.096634 .
## data_whitewine_reduced_labelled$sulphates 3.722 0.000197 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5116.8 on 4897 degrees of freedom
## Residual deviance: 4695.8 on 4890 degrees of freedom
## AIC: 4711.8
##
## Number of Fisher Scoring iterations: 5
As we can see citric.acid has the highest p-val of 0.401049, so we remove it.
white_lr_new_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~
data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity +
data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$chlorides +
data_whitewine_reduced_labelled$free.sulfur.dioxide + data_whitewine_reduced_labelled$sulphates,
family = binomial(link="logit"))
summary(white_lr_new_2)
##
## Call:
## glm(formula = data_whitewine_reduced_labelled$label_binary ~
## data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity +
## data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$chlorides +
## data_whitewine_reduced_labelled$free.sulfur.dioxide +
## data_whitewine_reduced_labelled$sulphates, family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5358 -0.7361 -0.5659 -0.2310 3.5685
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 2.230804 0.384252
## data_whitewine_reduced_labelled$fixed.acidity -0.168725 0.043345
## data_whitewine_reduced_labelled$volatile.acidity -1.679593 0.396111
## data_whitewine_reduced_labelled$residual.sugar -0.032726 0.008499
## data_whitewine_reduced_labelled$chlorides -57.401688 3.969023
## data_whitewine_reduced_labelled$free.sulfur.dioxide 0.003639 0.002248
## data_whitewine_reduced_labelled$sulphates 1.111044 0.302808
## z value Pr(>|z|)
## (Intercept) 5.806 6.41e-09 ***
## data_whitewine_reduced_labelled$fixed.acidity -3.893 9.92e-05 ***
## data_whitewine_reduced_labelled$volatile.acidity -4.240 2.23e-05 ***
## data_whitewine_reduced_labelled$residual.sugar -3.850 0.000118 ***
## data_whitewine_reduced_labelled$chlorides -14.462 < 2e-16 ***
## data_whitewine_reduced_labelled$free.sulfur.dioxide 1.619 0.105479
## data_whitewine_reduced_labelled$sulphates 3.669 0.000243 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5116.8 on 4897 degrees of freedom
## Residual deviance: 4696.5 on 4891 degrees of freedom
## AIC: 4710.5
##
## Number of Fisher Scoring iterations: 5
As we can see free.sulfur.dioxide has the highest p-val of 0.105479, so we remove it.
white_lr_new_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~
data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity +
data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$sulphates,
family = binomial(link="logit"))
summary(white_lr_new_3)
##
## Call:
## glm(formula = data_whitewine_reduced_labelled$label_binary ~
## data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$volatile.acidity +
## data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$chlorides +
## data_whitewine_reduced_labelled$sulphates, family = binomial(link = "logit"))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5533 -0.7373 -0.5638 -0.2288 3.5813
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 2.349256 0.376844 6.234
## data_whitewine_reduced_labelled$fixed.acidity -0.174425 0.043138 -4.043
## data_whitewine_reduced_labelled$volatile.acidity -1.749178 0.393433 -4.446
## data_whitewine_reduced_labelled$residual.sugar -0.028936 0.008142 -3.554
## data_whitewine_reduced_labelled$chlorides -56.711066 3.929915 -14.431
## data_whitewine_reduced_labelled$sulphates 1.143687 0.302058 3.786
## Pr(>|z|)
## (Intercept) 4.55e-10 ***
## data_whitewine_reduced_labelled$fixed.acidity 5.27e-05 ***
## data_whitewine_reduced_labelled$volatile.acidity 8.75e-06 ***
## data_whitewine_reduced_labelled$residual.sugar 0.000379 ***
## data_whitewine_reduced_labelled$chlorides < 2e-16 ***
## data_whitewine_reduced_labelled$sulphates 0.000153 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5116.8 on 4897 degrees of freedom
## Residual deviance: 4699.1 on 4892 degrees of freedom
## AIC: 4711.1
##
## Number of Fisher Scoring iterations: 5
None of the variables have a p-value lower than 0.001, so we stop here.
Significant variables: - fixed.acidity - volatile.acidity - residual.sugar - chlorides - sulphates
Forward Selection
white_lr_new_fs1_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$fixed.acidity,
family = binomial(link="logit"))
white_lr_new_fs1_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$volatile.acidity,
family = binomial(link="logit"))
white_lr_new_fs1_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$citric.acid,
family = binomial(link="logit"))
white_lr_new_fs1_4 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$residual.sugar,
family = binomial(link="logit"))
white_lr_new_fs1_5 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides,
family = binomial(link="logit"))
white_lr_new_fs1_6 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$free.sulfur.dioxide,
family = binomial(link="logit"))
white_lr_new_fs1_7 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$sulphates,
family = binomial(link="logit"))
summary(white_lr_new_fs1_1)$coef
## Estimate Std. Error z value
## (Intercept) 0.3649685 0.2930607 1.245368
## data_whitewine_reduced_labelled$fixed.acidity -0.2426636 0.0430520 -5.636524
## Pr(>|z|)
## (Intercept) 2.129964e-01
## data_whitewine_reduced_labelled$fixed.acidity 1.735170e-08
summary(white_lr_new_fs1_2)$coef
## Estimate Std. Error
## (Intercept) -0.8050328 0.1066823
## data_whitewine_reduced_labelled$volatile.acidity -1.7628065 0.3758387
## z value Pr(>|z|)
## (Intercept) -7.546078 4.485630e-14
## data_whitewine_reduced_labelled$volatile.acidity -4.690327 2.727689e-06
summary(white_lr_new_fs1_3)$coef
## Estimate Std. Error z value
## (Intercept) -1.0438859 0.1033727 -10.09828
## data_whitewine_reduced_labelled$citric.acid -0.7331738 0.2967174 -2.47095
## Pr(>|z|)
## (Intercept) 5.622114e-24
## data_whitewine_reduced_labelled$citric.acid 1.347548e-02
summary(white_lr_new_fs1_4)$coef
## Estimate Std. Error
## (Intercept) -0.91701409 0.054677615
## data_whitewine_reduced_labelled$residual.sugar -0.06215676 0.007621927
## z value Pr(>|z|)
## (Intercept) -16.771289 3.958262e-63
## data_whitewine_reduced_labelled$residual.sugar -8.154993 3.492009e-16
summary(white_lr_new_fs1_5)$coef
## Estimate Std. Error z value
## (Intercept) 1.185817 0.1518908 7.807034
## data_whitewine_reduced_labelled$chlorides -59.372790 3.7355361 -15.894048
## Pr(>|z|)
## (Intercept) 5.854968e-15
## data_whitewine_reduced_labelled$chlorides 6.967899e-57
summary(white_lr_new_fs1_6)$coef
## Estimate Std. Error
## (Intercept) -1.166781659 0.080464419
## data_whitewine_reduced_labelled$free.sulfur.dioxide -0.003422994 0.002089073
## z value Pr(>|z|)
## (Intercept) -14.500591 1.201107e-47
## data_whitewine_reduced_labelled$free.sulfur.dioxide -1.638523 1.013126e-01
summary(white_lr_new_fs1_7)$coef
## Estimate Std. Error z value
## (Intercept) -1.770770 0.1512616 -11.706678
## data_whitewine_reduced_labelled$sulphates 0.981025 0.2962019 3.312014
## Pr(>|z|)
## (Intercept) 1.178061e-31
## data_whitewine_reduced_labelled$sulphates 9.262688e-04
As we can see chlorides has the lowest p-val of 6.967899e-57, so we include it.
white_lr_new_fs2_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$fixed.acidity,
family = binomial(link="logit"))
white_lr_new_fs2_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity,
family = binomial(link="logit"))
white_lr_new_fs2_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$citric.acid,
family = binomial(link="logit"))
white_lr_new_fs2_4 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$residual.sugar,
family = binomial(link="logit"))
white_lr_new_fs2_5 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$free.sulfur.dioxide,
family = binomial(link="logit"))
white_lr_new_fs2_6 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$sulphates,
family = binomial(link="logit"))
summary(white_lr_new_fs2_1)$coef
## Estimate Std. Error z value
## (Intercept) 2.3815023 0.31744117 7.502185
## data_whitewine_reduced_labelled$chlorides -58.0734708 3.74380252 -15.511895
## data_whitewine_reduced_labelled$fixed.acidity -0.1833189 0.04274359 -4.288804
## Pr(>|z|)
## (Intercept) 6.276283e-14
## data_whitewine_reduced_labelled$chlorides 2.882456e-54
## data_whitewine_reduced_labelled$fixed.acidity 1.796381e-05
summary(white_lr_new_fs2_2)$coef
## Estimate Std. Error
## (Intercept) 1.745379 0.1926771
## data_whitewine_reduced_labelled$chlorides -60.486196 3.7829969
## data_whitewine_reduced_labelled$volatile.acidity -1.891691 0.3925510
## z value Pr(>|z|)
## (Intercept) 9.058570 1.321721e-19
## data_whitewine_reduced_labelled$chlorides -15.988963 1.525498e-57
## data_whitewine_reduced_labelled$volatile.acidity -4.818967 1.443031e-06
summary(white_lr_new_fs2_3)$coef
## Estimate Std. Error z value
## (Intercept) 1.3065633 0.1800580 7.256346
## data_whitewine_reduced_labelled$chlorides -59.1752674 3.7360278 -15.839086
## data_whitewine_reduced_labelled$citric.acid -0.3901545 0.3119171 -1.250827
## Pr(>|z|)
## (Intercept) 3.976857e-13
## data_whitewine_reduced_labelled$chlorides 1.672332e-56
## data_whitewine_reduced_labelled$citric.acid 2.109974e-01
summary(white_lr_new_fs2_4)$coef
## Estimate Std. Error
## (Intercept) 1.22177633 0.151417259
## data_whitewine_reduced_labelled$chlorides -54.94345325 3.813705782
## data_whitewine_reduced_labelled$residual.sugar -0.03649588 0.008070705
## z value Pr(>|z|)
## (Intercept) 8.068937 7.091283e-16
## data_whitewine_reduced_labelled$chlorides -14.406841 4.686604e-47
## data_whitewine_reduced_labelled$residual.sugar -4.522019 6.125268e-06
summary(white_lr_new_fs2_5)$coef
## Estimate Std. Error
## (Intercept) 1.120614186 0.158890039
## data_whitewine_reduced_labelled$chlorides -60.404743716 3.822414388
## data_whitewine_reduced_labelled$free.sulfur.dioxide 0.003058127 0.002140933
## z value Pr(>|z|)
## (Intercept) 7.052766 1.753963e-12
## data_whitewine_reduced_labelled$chlorides -15.802772 2.977307e-56
## data_whitewine_reduced_labelled$free.sulfur.dioxide 1.428408 1.531744e-01
summary(white_lr_new_fs2_6)$coef
## Estimate Std. Error z value
## (Intercept) 0.5849592 0.2023250 2.891185
## data_whitewine_reduced_labelled$chlorides -60.9418598 3.7821418 -16.113055
## data_whitewine_reduced_labelled$sulphates 1.3469785 0.2982418 4.516397
## Pr(>|z|)
## (Intercept) 3.837916e-03
## data_whitewine_reduced_labelled$chlorides 2.065612e-58
## data_whitewine_reduced_labelled$sulphates 6.290072e-06
As we can see volatile.acidity has the lowest p-val of 1.443031e-06, so we include it.
white_lr_new_fs3_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity,
family = binomial(link="logit"))
white_lr_new_fs3_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$citric.acid,
family = binomial(link="logit"))
white_lr_new_fs3_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$residual.sugar,
family = binomial(link="logit"))
white_lr_new_fs3_4 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$free.sulfur.dioxide,
family = binomial(link="logit"))
white_lr_new_fs3_5 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides +data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$sulphates,
family = binomial(link="logit"))
summary(white_lr_new_fs3_1)$coef
## Estimate Std. Error
## (Intercept) 3.0283975 0.34547184
## data_whitewine_reduced_labelled$chlorides -59.1278125 3.79042048
## data_whitewine_reduced_labelled$volatile.acidity -1.9680226 0.39427314
## data_whitewine_reduced_labelled$fixed.acidity -0.1934706 0.04312213
## z value Pr(>|z|)
## (Intercept) 8.765975 1.851672e-18
## data_whitewine_reduced_labelled$chlorides -15.599275 7.362516e-55
## data_whitewine_reduced_labelled$volatile.acidity -4.991521 5.990574e-07
## data_whitewine_reduced_labelled$fixed.acidity -4.486574 7.237755e-06
summary(white_lr_new_fs3_2)$coef
## Estimate Std. Error
## (Intercept) 2.003973 0.2287624
## data_whitewine_reduced_labelled$chlorides -60.239455 3.7828839
## data_whitewine_reduced_labelled$volatile.acidity -2.033955 0.3985429
## data_whitewine_reduced_labelled$citric.acid -0.697307 0.3285760
## z value Pr(>|z|)
## (Intercept) 8.760061 1.951444e-18
## data_whitewine_reduced_labelled$chlorides -15.924215 4.303838e-57
## data_whitewine_reduced_labelled$volatile.acidity -5.103478 3.334675e-07
## data_whitewine_reduced_labelled$citric.acid -2.122209 3.382018e-02
summary(white_lr_new_fs3_3)$coef
## Estimate Std. Error
## (Intercept) 1.72872675 0.190930451
## data_whitewine_reduced_labelled$chlorides -56.21774755 3.873438973
## data_whitewine_reduced_labelled$volatile.acidity -1.74084401 0.390801533
## data_whitewine_reduced_labelled$residual.sugar -0.03331829 0.008085571
## z value Pr(>|z|)
## (Intercept) 9.054222 1.375444e-19
## data_whitewine_reduced_labelled$chlorides -14.513653 9.928957e-48
## data_whitewine_reduced_labelled$volatile.acidity -4.454548 8.407030e-06
## data_whitewine_reduced_labelled$residual.sugar -4.120709 3.777079e-05
summary(white_lr_new_fs3_4)$coef
## Estimate Std. Error
## (Intercept) 1.689819042 0.200428074
## data_whitewine_reduced_labelled$chlorides -61.216631392 3.861942664
## data_whitewine_reduced_labelled$volatile.acidity -1.862613393 0.394028296
## data_whitewine_reduced_labelled$free.sulfur.dioxide 0.002205921 0.002165851
## z value Pr(>|z|)
## (Intercept) 8.431050 3.425783e-17
## data_whitewine_reduced_labelled$chlorides -15.851253 1.378049e-56
## data_whitewine_reduced_labelled$volatile.acidity -4.727106 2.277428e-06
## data_whitewine_reduced_labelled$free.sulfur.dioxide 1.018501 3.084400e-01
summary(white_lr_new_fs3_5)$coef
## Estimate Std. Error
## (Intercept) 1.151284 0.2381471
## data_whitewine_reduced_labelled$chlorides -61.883479 3.8241263
## data_whitewine_reduced_labelled$volatile.acidity -1.798756 0.3924316
## data_whitewine_reduced_labelled$sulphates 1.268275 0.2993933
## z value Pr(>|z|)
## (Intercept) 4.834342 1.335871e-06
## data_whitewine_reduced_labelled$chlorides -16.182384 6.714358e-59
## data_whitewine_reduced_labelled$volatile.acidity -4.583616 4.570038e-06
## data_whitewine_reduced_labelled$sulphates 4.236151 2.273842e-05
As we can see fixed.acidity has the lowest p-val of 7.237755e-06, so we include it.
white_lr_new_fs4_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$citric.acid,
family = binomial(link="logit"))
white_lr_new_fs4_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$residual.sugar,
family = binomial(link="logit"))
white_lr_new_fs4_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$free.sulfur.dioxide,
family = binomial(link="logit"))
white_lr_new_fs4_4 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides +data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$sulphates,
family = binomial(link="logit"))
summary(white_lr_new_fs4_1)$coef
## Estimate Std. Error
## (Intercept) 3.0580736 0.34759644
## data_whitewine_reduced_labelled$chlorides -59.1064153 3.79021180
## data_whitewine_reduced_labelled$volatile.acidity -2.0211551 0.39979627
## data_whitewine_reduced_labelled$fixed.acidity -0.1825649 0.04515048
## data_whitewine_reduced_labelled$citric.acid -0.2741183 0.34004038
## z value Pr(>|z|)
## (Intercept) 8.7977703 1.395613e-18
## data_whitewine_reduced_labelled$chlorides -15.5944888 7.935613e-55
## data_whitewine_reduced_labelled$volatile.acidity -5.0554626 4.293487e-07
## data_whitewine_reduced_labelled$fixed.acidity -4.0434770 5.266433e-05
## data_whitewine_reduced_labelled$citric.acid -0.8061345 4.201653e-01
summary(white_lr_new_fs4_2)$coef
## Estimate Std. Error
## (Intercept) 2.92982972 0.344471902
## data_whitewine_reduced_labelled$chlorides -55.24494435 3.878659609
## data_whitewine_reduced_labelled$volatile.acidity -1.83154670 0.393007186
## data_whitewine_reduced_labelled$fixed.acidity -0.18060024 0.042983412
## data_whitewine_reduced_labelled$residual.sugar -0.03096465 0.008122731
## z value Pr(>|z|)
## (Intercept) 8.505279 1.811590e-17
## data_whitewine_reduced_labelled$chlorides -14.243308 4.933512e-46
## data_whitewine_reduced_labelled$volatile.acidity -4.660339 3.156891e-06
## data_whitewine_reduced_labelled$fixed.acidity -4.201627 2.650041e-05
## data_whitewine_reduced_labelled$residual.sugar -3.812099 1.377919e-04
summary(white_lr_new_fs4_3)$coef
## Estimate Std. Error
## (Intercept) 2.975890959 0.352830537
## data_whitewine_reduced_labelled$chlorides -59.672313143 3.869527423
## data_whitewine_reduced_labelled$volatile.acidity -1.945079665 0.395779238
## data_whitewine_reduced_labelled$fixed.acidity -0.191755975 0.043224028
## data_whitewine_reduced_labelled$free.sulfur.dioxide 0.001617288 0.002179921
## z value Pr(>|z|)
## (Intercept) 8.4343350 3.330913e-17
## data_whitewine_reduced_labelled$chlorides -15.4210855 1.181025e-53
## data_whitewine_reduced_labelled$volatile.acidity -4.9145571 8.898339e-07
## data_whitewine_reduced_labelled$fixed.acidity -4.4363282 9.150630e-06
## data_whitewine_reduced_labelled$free.sulfur.dioxide 0.7419023 4.581466e-01
summary(white_lr_new_fs4_4)$coef
## Estimate Std. Error
## (Intercept) 2.4114505 0.37722072
## data_whitewine_reduced_labelled$chlorides -60.4509733 3.83246640
## data_whitewine_reduced_labelled$volatile.acidity -1.8789800 0.39436718
## data_whitewine_reduced_labelled$fixed.acidity -0.1866641 0.04325869
## data_whitewine_reduced_labelled$sulphates 1.2171385 0.30100971
## z value Pr(>|z|)
## (Intercept) 6.392678 1.630056e-10
## data_whitewine_reduced_labelled$chlorides -15.773386 4.743669e-56
## data_whitewine_reduced_labelled$volatile.acidity -4.764544 1.892807e-06
## data_whitewine_reduced_labelled$fixed.acidity -4.315065 1.595558e-05
## data_whitewine_reduced_labelled$sulphates 4.043519 5.265491e-05
As we can see sulphates has the lowest p-val of 5.265491e-05, so we include it.
white_lr_new_fs5_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$sulphates + data_whitewine_reduced_labelled$citric.acid,
family = binomial(link="logit"))
white_lr_new_fs5_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$sulphates + data_whitewine_reduced_labelled$residual.sugar,
family = binomial(link="logit"))
white_lr_new_fs5_3 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$sulphates + data_whitewine_reduced_labelled$free.sulfur.dioxide,
family = binomial(link="logit"))
summary(white_lr_new_fs5_1)$coef
## Estimate Std. Error
## (Intercept) 2.4394796 0.37831714
## data_whitewine_reduced_labelled$chlorides -60.4403506 3.83202729
## data_whitewine_reduced_labelled$volatile.acidity -1.9512694 0.39986934
## data_whitewine_reduced_labelled$fixed.acidity -0.1714310 0.04535132
## data_whitewine_reduced_labelled$sulphates 1.2422769 0.30189875
## data_whitewine_reduced_labelled$citric.acid -0.3788683 0.34349231
## z value Pr(>|z|)
## (Intercept) 6.448240 1.131566e-10
## data_whitewine_reduced_labelled$chlorides -15.772422 4.816687e-56
## data_whitewine_reduced_labelled$volatile.acidity -4.879768 1.062109e-06
## data_whitewine_reduced_labelled$fixed.acidity -3.780067 1.567864e-04
## data_whitewine_reduced_labelled$sulphates 4.114879 3.873822e-05
## data_whitewine_reduced_labelled$citric.acid -1.102989 2.700318e-01
summary(white_lr_new_fs5_2)$coef
## Estimate Std. Error
## (Intercept) 2.34925637 0.376843516
## data_whitewine_reduced_labelled$chlorides -56.71106621 3.929914633
## data_whitewine_reduced_labelled$volatile.acidity -1.74917784 0.393433164
## data_whitewine_reduced_labelled$fixed.acidity -0.17442458 0.043137984
## data_whitewine_reduced_labelled$sulphates 1.14368693 0.302057943
## data_whitewine_reduced_labelled$residual.sugar -0.02893561 0.008141624
## z value Pr(>|z|)
## (Intercept) 6.234037 4.545660e-10
## data_whitewine_reduced_labelled$chlorides -14.430610 3.321296e-47
## data_whitewine_reduced_labelled$volatile.acidity -4.445934 8.751086e-06
## data_whitewine_reduced_labelled$fixed.acidity -4.043411 5.267927e-05
## data_whitewine_reduced_labelled$sulphates 3.786316 1.528970e-04
## data_whitewine_reduced_labelled$residual.sugar -3.554034 3.793700e-04
summary(white_lr_new_fs5_3)$coef
## Estimate Std. Error
## (Intercept) 2.376808817 0.38258366
## data_whitewine_reduced_labelled$chlorides -60.851311007 3.90763703
## data_whitewine_reduced_labelled$volatile.acidity -1.862268583 0.39575520
## data_whitewine_reduced_labelled$fixed.acidity -0.185452147 0.04334386
## data_whitewine_reduced_labelled$sulphates 1.209149771 0.30136939
## data_whitewine_reduced_labelled$free.sulfur.dioxide 0.001198594 0.00218560
## z value Pr(>|z|)
## (Intercept) 6.212521 5.214137e-10
## data_whitewine_reduced_labelled$chlorides -15.572406 1.121093e-54
## data_whitewine_reduced_labelled$volatile.acidity -4.705607 2.531113e-06
## data_whitewine_reduced_labelled$fixed.acidity -4.278626 1.880506e-05
## data_whitewine_reduced_labelled$sulphates 4.012185 6.015929e-05
## data_whitewine_reduced_labelled$free.sulfur.dioxide 0.548405 5.834138e-01
As we can see residual.sugar has the lowest p-val of 3.793700e-04, so we include it.
white_lr_new_fs6_1 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$sulphates + data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$citric.acid,
family = binomial(link="logit"))
white_lr_new_fs6_2 <- glm(formula = data_whitewine_reduced_labelled$label_binary ~ data_whitewine_reduced_labelled$chlorides + data_whitewine_reduced_labelled$volatile.acidity + data_whitewine_reduced_labelled$fixed.acidity + data_whitewine_reduced_labelled$sulphates + data_whitewine_reduced_labelled$residual.sugar + data_whitewine_reduced_labelled$free.sulfur.dioxide,
family = binomial(link="logit"))
summary(white_lr_new_fs6_1)$coef
## Estimate Std. Error
## (Intercept) 2.37151357 0.37814774
## data_whitewine_reduced_labelled$chlorides -56.77969829 3.93201930
## data_whitewine_reduced_labelled$volatile.acidity -1.80382694 0.40010766
## data_whitewine_reduced_labelled$fixed.acidity -0.16429642 0.04514789
## data_whitewine_reduced_labelled$sulphates 1.16197722 0.30304980
## data_whitewine_reduced_labelled$residual.sugar -0.02833074 0.00817925
## data_whitewine_reduced_labelled$citric.acid -0.26153096 0.34674608
## z value Pr(>|z|)
## (Intercept) 6.2713942 3.578294e-10
## data_whitewine_reduced_labelled$chlorides -14.4403407 2.884130e-47
## data_whitewine_reduced_labelled$volatile.acidity -4.5083539 6.533253e-06
## data_whitewine_reduced_labelled$fixed.acidity -3.6390722 2.736220e-04
## data_whitewine_reduced_labelled$sulphates 3.8342782 1.259336e-04
## data_whitewine_reduced_labelled$residual.sugar -3.4637333 5.327345e-04
## data_whitewine_reduced_labelled$citric.acid -0.7542435 4.507031e-01
summary(white_lr_new_fs6_2)$coef
## Estimate Std. Error
## (Intercept) 2.230803929 0.384252218
## data_whitewine_reduced_labelled$chlorides -57.401687957 3.969023247
## data_whitewine_reduced_labelled$volatile.acidity -1.679592812 0.396111441
## data_whitewine_reduced_labelled$fixed.acidity -0.168725171 0.043345141
## data_whitewine_reduced_labelled$sulphates 1.111044424 0.302807676
## data_whitewine_reduced_labelled$residual.sugar -0.032725962 0.008499171
## data_whitewine_reduced_labelled$free.sulfur.dioxide 0.003639028 0.002247906
## z value Pr(>|z|)
## (Intercept) 5.805572 6.414662e-09
## data_whitewine_reduced_labelled$chlorides -14.462422 2.093021e-47
## data_whitewine_reduced_labelled$volatile.acidity -4.240203 2.233180e-05
## data_whitewine_reduced_labelled$fixed.acidity -3.892597 9.917677e-05
## data_whitewine_reduced_labelled$sulphates 3.669142 2.433657e-04
## data_whitewine_reduced_labelled$residual.sugar -3.850489 1.178824e-04
## data_whitewine_reduced_labelled$free.sulfur.dioxide 1.618852 1.054791e-01
None of the variables have a p-value lower than 0.001, so we stop here.
Significant variables: - chlorides - volatile.acidity - fixed.acidity - sulphates - residual.sugar
For comparison, we also provide the classifiers trained with the full dataset (which may have correlated variables as we have explained above).
White Wine
### White Wine NB Analysis
nb_model_white <- naiveBayes(as.factor(label) ~ . - label_binary , traindata_white_full, laplace=.01)
nb_prediction_white <- predict(nb_model_white,
# remove column "label"
testdata_white_full[,!(names(data_white_wine_labelled) %in% c('label', 'label_binary'))],
type='raw')
nb_score_white <- nb_prediction_white[, c("good")]
actual_class_white <- testdata_white_full$label == 'good'
nb_pred_white <- prediction(nb_score_white, actual_class_white)
nb_perf_white <- performance(nb_pred_white, "tpr", "fpr")
plot(nb_perf_white, col="blue", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
nb_auc_white <- performance(nb_pred_white, "auc")
nb_auc_white <- unlist(slot(nb_auc_white, "y.values"))
nb_auc_white
## [1] 0.8049925
Red Wine
nb_model_red <- naiveBayes(as.factor(label) ~ . - label_binary, traindata_red_full, laplace=.01)
nb_prediction_red <- predict(nb_model_red,
# remove column "label"
testdata_red_full[,!(names(data_red_wine_labelled) %in% c('label','label_binary'))],
type='raw')
nb_score_red <- nb_prediction_red[, c("good")]
actual_class_red <- testdata_red_full$label == 'good'
nb_pred_red <- prediction(nb_score_red, actual_class_red)
nb_perf_red <- performance(nb_pred_red, "tpr", "fpr")
plot(nb_perf_red, lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
nb_auc_red <- performance(nb_pred_red, "auc")
nb_auc_red <- unlist(slot(nb_auc_red, "y.values"))
nb_auc_red
## [1] 0.8303
Side-by-side Plot
par(mfrow=c(1,2))
plot(nb_perf_red, col="red", lwd=2, main="Red Wine Quality Label NB Model Prediction Perf", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
plot(nb_perf_white, col="blue", lwd=2, main="White Wine Quality Label NB Model Prediction Perf", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
Multiple lines in single plot
par(mfrow=c(1,1))
plot(nb_perf_red, col="red", lwd=2, main="Wine Quality Label NB Model Prediction Perf (Full Dataset)", xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
# par(new = TRUE) # We draw on top of the existing plot.
# plot(perf, col = "red", lwd=2, main="", xlab="", ylab="")
par(new = TRUE) # We draw on top of the existing plot.
plot(nb_perf_white, col = "blue", lwd=2, main="", xlab="", ylab="")
legend("topleft",
legend = c("Red Wine", "White Wine"),
col = c("red", "blue"),
pch = c(0, 0, 0))
AUCs (Full Dataset)
print(paste0("AUC of NB Model (Red Wine Full Dataset):", nb_auc_red))
## [1] "AUC of NB Model (Red Wine Full Dataset):0.830299966295919"
print(paste0("AUC of NB Model (White Wine Full Dataset):", nb_auc_white))
## [1] "AUC of NB Model (White Wine Full Dataset):0.804992497109541"
Red Wine (Full Dataset)
red_lr_train <- glm(formula = label_binary ~
fixed.acidity + volatile.acidity + citric.acid
+ residual.sugar + chlorides + free.sulfur.dioxide
+ total.sulfur.dioxide + density + pH
+ sulphates + alcohol, data = traindata_red_full,
family = binomial(link="logit"))
summary(red_lr_train)
##
## Call:
## glm(formula = label_binary ~ fixed.acidity + volatile.acidity +
## citric.acid + residual.sugar + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + density + pH + sulphates + alcohol,
## family = binomial(link = "logit"), data = traindata_red_full)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0571 -0.4216 -0.2163 -0.1170 2.8312
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.939e+02 1.214e+02 2.422 0.01544 *
## fixed.acidity 2.644e-01 1.417e-01 1.866 0.06209 .
## volatile.acidity -2.558e+00 8.966e-01 -2.853 0.00434 **
## citric.acid 6.942e-01 9.381e-01 0.740 0.45925
## residual.sugar 2.475e-01 7.944e-02 3.116 0.00184 **
## chlorides -8.246e+00 3.434e+00 -2.401 0.01634 *
## free.sulfur.dioxide 1.566e-02 1.379e-02 1.136 0.25615
## total.sulfur.dioxide -1.869e-02 5.937e-03 -3.148 0.00164 **
## density -3.059e+02 1.240e+02 -2.467 0.01364 *
## pH -7.443e-01 1.173e+00 -0.634 0.52580
## sulphates 3.790e+00 6.362e-01 5.958 2.55e-09 ***
## alcohol 7.438e-01 1.457e-01 5.105 3.31e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1017.65 on 1279 degrees of freedom
## Residual deviance: 687.74 on 1268 degrees of freedom
## AIC: 711.74
##
## Number of Fisher Scoring iterations: 6
pred = predict.glm(red_lr_train,testdata_red_full, type="response")
predObj <- prediction(pred, testdata_red_full$label_binary)
rocObj = performance(predObj, measure="tpr", x.measure="fpr")
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:",
round(aucObj@y.values[[1]] ,4)))
lr_auc_red <- round(aucObj@y.values[[1]] ,4)
White Wine (Full Dataset)
white_lr_train <- glm(formula = label_binary ~
fixed.acidity + volatile.acidity + citric.acid
+ residual.sugar + chlorides + free.sulfur.dioxide
+ total.sulfur.dioxide + density + pH
+ sulphates + alcohol, data = traindata_white_full,
family = binomial(link="logit"))
summary(white_lr_train)
##
## Call:
## glm(formula = label_binary ~ fixed.acidity + volatile.acidity +
## citric.acid + residual.sugar + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + density + pH + sulphates + alcohol,
## family = binomial(link = "logit"), data = traindata_white_full)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2477 -0.6742 -0.4220 -0.1834 2.7858
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.264e+02 1.044e+02 6.001 1.96e-09 ***
## fixed.acidity 5.926e-01 1.004e-01 5.902 3.60e-09 ***
## volatile.acidity -3.749e+00 5.369e-01 -6.983 2.89e-12 ***
## citric.acid -7.900e-01 4.441e-01 -1.779 0.0752 .
## residual.sugar 2.984e-01 3.951e-02 7.553 4.27e-14 ***
## chlorides -1.087e+01 4.144e+00 -2.623 0.0087 **
## free.sulfur.dioxide 8.196e-03 3.465e-03 2.365 0.0180 *
## total.sulfur.dioxide -3.236e-04 1.681e-03 -0.192 0.8474
## density -6.499e+02 1.058e+02 -6.143 8.12e-10 ***
## pH 3.488e+00 4.747e-01 7.347 2.02e-13 ***
## sulphates 1.909e+00 3.857e-01 4.949 7.45e-07 ***
## alcohol 1.433e-01 1.263e-01 1.135 0.2565
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4093.7 on 3918 degrees of freedom
## Residual deviance: 3348.4 on 3907 degrees of freedom
## AIC: 3372.4
##
## Number of Fisher Scoring iterations: 6
pred = predict.glm(white_lr_train, testdata_white_full, type="response")
predObj <- prediction(pred, testdata_white_full$label_binary)
rocObj = performance(predObj, measure="tpr", x.measure="fpr")
aucObj = performance(predObj, measure="auc")
plot(rocObj, main = paste("Area under the curve:",
round(aucObj@y.values[[1]] ,4)))
lr_auc_white <- round(aucObj@y.values[[1]] ,4)
Red Wine (Full Dataset)
# Red wine (full dataset)
dt_fit_red_full <- rpart(label ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data=traindata_red_full,control=rpart.control(minsplit=128), method="class",
parms=list(split='information'))
summary(dt_fit_red_full)
## Call:
## rpart(formula = label ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = traindata_red_full,
## method = "class", parms = list(split = "information"), control = rpart.control(minsplit = 128))
## n= 1280
##
## CP nsplit rel error xerror xstd
## 1 0.03256705 0 1.0000000 1.0000000 0.07046897
## 2 0.01000000 3 0.9022989 0.9655172 0.06943090
##
## Variable importance
## alcohol volatile.acidity density
## 36 15 13
## sulphates citric.acid chlorides
## 10 10 8
## fixed.acidity pH total.sulfur.dioxide
## 3 3 1
## free.sulfur.dioxide
## 1
##
## Node number 1: 1280 observations, complexity param=0.03256705
## predicted class=bad expected loss=0.1359375 P(node) =1
## class counts: 1106 174
## probabilities: 0.864 0.136
## left son=2 (739 obs) right son=3 (541 obs)
## Primary splits:
## alcohol < 10.45 to the left, improve=83.58557, (0 missing)
## volatile.acidity < 0.385 to the right, improve=50.87817, (0 missing)
## sulphates < 0.685 to the left, improve=47.38478, (0 missing)
## citric.acid < 0.315 to the left, improve=46.52559, (0 missing)
## density < 0.99536 to the right, improve=30.14998, (0 missing)
## Surrogate splits:
## density < 0.995745 to the right, agree=0.732, adj=0.366, (0 split)
## chlorides < 0.0685 to the right, agree=0.662, adj=0.201, (0 split)
## volatile.acidity < 0.385 to the right, agree=0.637, adj=0.140, (0 split)
## sulphates < 0.675 to the left, agree=0.636, adj=0.139, (0 split)
## citric.acid < 0.315 to the left, agree=0.624, adj=0.111, (0 split)
##
## Node number 2: 739 observations
## predicted class=bad expected loss=0.03247632 P(node) =0.5773438
## class counts: 715 24
## probabilities: 0.968 0.032
##
## Node number 3: 541 observations, complexity param=0.03256705
## predicted class=bad expected loss=0.2772643 P(node) =0.4226563
## class counts: 391 150
## probabilities: 0.723 0.277
## left son=6 (306 obs) right son=7 (235 obs)
## Primary splits:
## volatile.acidity < 0.425 to the right, improve=24.20970, (0 missing)
## sulphates < 0.735 to the left, improve=22.79706, (0 missing)
## citric.acid < 0.275 to the left, improve=21.83648, (0 missing)
## alcohol < 11.55 to the left, improve=20.39001, (0 missing)
## pH < 3.355 to the right, improve=16.80804, (0 missing)
## Surrogate splits:
## citric.acid < 0.315 to the left, agree=0.808, adj=0.557, (0 split)
## fixed.acidity < 8.15 to the left, agree=0.693, adj=0.294, (0 split)
## pH < 3.305 to the right, agree=0.688, adj=0.281, (0 split)
## sulphates < 0.745 to the left, agree=0.632, adj=0.153, (0 split)
## total.sulfur.dioxide < 10.5 to the right, agree=0.603, adj=0.085, (0 split)
##
## Node number 6: 306 observations
## predicted class=bad expected loss=0.1601307 P(node) =0.2390625
## class counts: 257 49
## probabilities: 0.840 0.160
##
## Node number 7: 235 observations, complexity param=0.03256705
## predicted class=bad expected loss=0.4297872 P(node) =0.1835938
## class counts: 134 101
## probabilities: 0.570 0.430
## left son=14 (130 obs) right son=15 (105 obs)
## Primary splits:
## sulphates < 0.735 to the left, improve=8.925273, (0 missing)
## total.sulfur.dioxide < 50.5 to the right, improve=6.021858, (0 missing)
## pH < 3.385 to the right, improve=5.604761, (0 missing)
## density < 0.99538 to the right, improve=4.786024, (0 missing)
## volatile.acidity < 0.335 to the right, improve=4.260359, (0 missing)
## Surrogate splits:
## free.sulfur.dioxide < 20.5 to the left, agree=0.626, adj=0.162, (0 split)
## total.sulfur.dioxide < 43.5 to the left, agree=0.617, adj=0.143, (0 split)
## chlorides < 0.0785 to the right, agree=0.604, adj=0.114, (0 split)
## alcohol < 11.05 to the right, agree=0.600, adj=0.105, (0 split)
## density < 0.99654 to the left, agree=0.583, adj=0.067, (0 split)
##
## Node number 14: 130 observations
## predicted class=bad expected loss=0.3076923 P(node) =0.1015625
## class counts: 90 40
## probabilities: 0.692 0.308
##
## Node number 15: 105 observations
## predicted class=good expected loss=0.4190476 P(node) =0.08203125
## class counts: 44 61
## probabilities: 0.419 0.581
rpart.plot(dt_fit_red_full, type=4, extra=1)
dt_prediction_red_full <- predict(dt_fit_red_full,
testdata_red_full[, !(names(data_red_wine_labelled) %in% c('label', 'label_binary'))],
type='prob')
dt_pred_red_full <- prediction(dt_prediction_red_full[, 2], testdata_red_full$label)
dt_perf_red_full <- performance(dt_pred_red_full, "tpr", "fpr")
plot(dt_perf_red_full, col="red", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
dt_auc_red_full <- performance(dt_pred_red_full, "auc")
dt_auc_red_full <- unlist(slot(dt_auc_red_full,"y.values"))
dt_auc_red_full
## [1] 0.8300472
White Wine (Full Dataset)
# White wine (full dataset)
dt_fit_white_full <- rpart(label ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data=traindata_white_full,control=rpart.control(minsplit=390), method="class",
parms=list(split='information'))
summary(dt_fit_white_full )
## Call:
## rpart(formula = label ~ fixed.acidity + volatile.acidity + citric.acid +
## residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide +
## density + pH + sulphates + alcohol, data = traindata_white_full,
## method = "class", parms = list(split = "information"), control = rpart.control(minsplit = 390))
## n= 3919
##
## CP nsplit rel error xerror xstd
## 1 0.02515723 0 1.0000000 1.0000000 0.03039865
## 2 0.01179245 3 0.9245283 0.9316038 0.02961644
## 3 0.01000000 4 0.9127358 0.9540094 0.02987934
##
## Variable importance
## alcohol density chlorides
## 41 24 14
## total.sulfur.dioxide residual.sugar sulphates
## 9 7 2
## volatile.acidity free.sulfur.dioxide
## 1 1
##
## Node number 1: 3919 observations, complexity param=0.02515723
## predicted class=bad expected loss=0.2163817 P(node) =1
## class counts: 3071 848
## probabilities: 0.784 0.216
## left son=2 (2297 obs) right son=3 (1622 obs)
## Primary splits:
## alcohol < 10.625 to the left, improve=233.72620, (0 missing)
## density < 0.992215 to the right, improve=166.06470, (0 missing)
## chlorides < 0.0395 to the right, improve=127.94500, (0 missing)
## total.sulfur.dioxide < 171.5 to the right, improve= 71.21220, (0 missing)
## residual.sugar < 6.35 to the right, improve= 36.16577, (0 missing)
## Surrogate splits:
## density < 0.992545 to the right, agree=0.851, adj=0.641, (0 split)
## chlorides < 0.0375 to the right, agree=0.745, adj=0.383, (0 split)
## total.sulfur.dioxide < 132.5 to the right, agree=0.692, adj=0.256, (0 split)
## residual.sugar < 5.275 to the right, agree=0.676, adj=0.218, (0 split)
## sulphates < 0.375 to the right, agree=0.609, adj=0.055, (0 split)
##
## Node number 2: 2297 observations
## predicted class=bad expected loss=0.09708315 P(node) =0.5861189
## class counts: 2074 223
## probabilities: 0.903 0.097
##
## Node number 3: 1622 observations, complexity param=0.02515723
## predicted class=bad expected loss=0.3853268 P(node) =0.4138811
## class counts: 997 625
## probabilities: 0.615 0.385
## left son=6 (943 obs) right son=7 (679 obs)
## Primary splits:
## alcohol < 11.875 to the left, improve=44.62699, (0 missing)
## chlorides < 0.0495 to the right, improve=20.65826, (0 missing)
## free.sulfur.dioxide < 13.5 to the left, improve=18.62282, (0 missing)
## pH < 3.295 to the left, improve=17.30513, (0 missing)
## density < 0.991355 to the right, improve=17.01734, (0 missing)
## Surrogate splits:
## density < 0.990315 to the right, agree=0.759, adj=0.424, (0 split)
## volatile.acidity < 0.3175 to the left, agree=0.654, adj=0.172, (0 split)
## chlorides < 0.0355 to the right, agree=0.633, adj=0.124, (0 split)
## sulphates < 0.765 to the left, agree=0.592, adj=0.025, (0 split)
## fixed.acidity < 5.25 to the right, agree=0.589, adj=0.018, (0 split)
##
## Node number 6: 943 observations
## predicted class=bad expected loss=0.2884411 P(node) =0.2406226
## class counts: 671 272
## probabilities: 0.712 0.288
##
## Node number 7: 679 observations, complexity param=0.02515723
## predicted class=good expected loss=0.4801178 P(node) =0.1732585
## class counts: 326 353
## probabilities: 0.480 0.520
## left son=14 (491 obs) right son=15 (188 obs)
## Primary splits:
## alcohol < 12.775 to the left, improve=11.972890, (0 missing)
## free.sulfur.dioxide < 21.5 to the left, improve=10.241170, (0 missing)
## residual.sugar < 1.55 to the left, improve= 8.882023, (0 missing)
## volatile.acidity < 0.275 to the left, improve= 5.875972, (0 missing)
## pH < 3.295 to the left, improve= 5.477827, (0 missing)
## Surrogate splits:
## density < 0.989135 to the right, agree=0.775, adj=0.186, (0 split)
## volatile.acidity < 0.485 to the left, agree=0.741, adj=0.064, (0 split)
## sulphates < 0.275 to the right, agree=0.732, adj=0.032, (0 split)
## fixed.acidity < 5.05 to the right, agree=0.729, adj=0.021, (0 split)
## chlorides < 0.1175 to the left, agree=0.726, adj=0.011, (0 split)
##
## Node number 14: 491 observations, complexity param=0.01179245
## predicted class=bad expected loss=0.4623218 P(node) =0.1252871
## class counts: 264 227
## probabilities: 0.538 0.462
## left son=28 (139 obs) right son=29 (352 obs)
## Primary splits:
## free.sulfur.dioxide < 23.5 to the left, improve=6.851487, (0 missing)
## fixed.acidity < 6.35 to the left, improve=6.707328, (0 missing)
## residual.sugar < 1.775 to the left, improve=5.417168, (0 missing)
## volatile.acidity < 0.3075 to the left, improve=4.794428, (0 missing)
## total.sulfur.dioxide < 117.5 to the left, improve=4.381399, (0 missing)
## Surrogate splits:
## total.sulfur.dioxide < 80.5 to the left, agree=0.794, adj=0.273, (0 split)
## volatile.acidity < 0.1575 to the left, agree=0.731, adj=0.050, (0 split)
## chlorides < 0.071 to the right, agree=0.721, adj=0.014, (0 split)
##
## Node number 15: 188 observations
## predicted class=good expected loss=0.3297872 P(node) =0.04797142
## class counts: 62 126
## probabilities: 0.330 0.670
##
## Node number 28: 139 observations
## predicted class=bad expected loss=0.3309353 P(node) =0.03546823
## class counts: 93 46
## probabilities: 0.669 0.331
##
## Node number 29: 352 observations
## predicted class=good expected loss=0.4857955 P(node) =0.08981883
## class counts: 171 181
## probabilities: 0.486 0.514
rpart.plot(dt_fit_white_full , type=4, extra=1)
dt_prediction_white_full <- predict(dt_fit_white_full,
testdata_white_full[, !(names(data_white_wine_labelled) %in% c('label', 'label_binary'))],
type='prob')
dt_pred_white_full <- prediction(dt_prediction_white_full[, 2], testdata_white_full$label)
dt_perf_white_full <- performance(dt_pred_white_full, "tpr", "fpr")
plot(dt_perf_white_full, col="blue", lwd=2, xlab="False Positive Rate (FPR)", ylab="True Positive Rate (TPR)")
abline(a=0, b=1, col="gray50", lty=3)
dt_auc_white_full <- performance(dt_pred_white_full, "auc")
dt_auc_white_full <- unlist(slot(dt_auc_white_full,"y.values"))
dt_auc_white_full
## [1] 0.7795565